Sunday, 20 November 2011

Controlling the Web


Google lawyers explain how the financial blockade of Wikileaks could apply to anyone under SOPA.

Documents obtained by Wall St Journal open window into a new global market for off-the-shelf surveillance technology

UK Internet Censors
Blacklist Fileserve File-Hosting Service

HP Computers Underpin Syrian Surveillance

When you request your personal data, Facebook generously keeps most of it so it doesn't overwhelm you.

The Internet gold rush: why your data is valuable.

A secure Internet can save people's lives.

The Internet can recognise your face.

How does Twitter choose trending topics?

Who controls the Internet?

Who controls Asian cyberspace?

Introduction to Parallel Programming and MapReduce

This may, for some of you, be helpful background for the lecture on Indexing the Web.

Google Code: Introduction to Parallel Programming and MapReduce

Wednesday, 16 November 2011

Eliza lab

Today you'll experiment with using regular expressions to mimic Eliza's responses. Using this tool, type in search patterns to match the kinds of sentences that people often type as input to Eliza. Then in the replacement pattern box, write a pattern to produce an Eliza-like response. For example, the search pattern
  • I am (\b\w+\b)
with the replacement pattern
  • How long have you been $1?
will match the sentence
  • I am bored
typed in the input text box and produce
  • How long have you been bored?
in the output text box.
So, now you try it. Write a regular expression search pattern to match sentences of the form:
  • I am sick
  • I am tired
  • I am hungry
Does your search pattern match the following strings and produce the kind of response you would expect?
  • Today I am bored
  • I feel very happy
  • I feel awfully tired
If not, fix your search pattern so that the system generates sensible outputs for each of these. Now try these
  • I feel cold and I want a nap
  • I feel cold and I want a hot drink
  • I like food and drink
  • I like to sing and dance
What if you wanted to generate the response:
  • Have you always enjoyed eating and drinking
in response to the input:
  • I like to eat and drink
Would your search pattern also work for
  • I like to sing and dance
Now write a search pattern that will match sentences like:
  • I am tired and hungry
and produce the response:
  • What do you think makes you hungry and tired?
Now play with this version of Eliza. Find cases where it doesn't respond very naturally and try and write search and replacement patterns that solve the problems you encounter. (Note: they may not all be solvable just by this simple search and replace method.)

Monday, 14 November 2011

More links on SOPA and Protect-IP

SOPA stands for the Stop Online Piracy Act. It is one of two bills being considered by the US congress. The other is the Protect-IP Act.

Note that in a topical and contentious area such as this, Wipipedia may not reflect a settled consensus. It is useful, and sometimes illuminating, to look at the page history to see the "edit wars" that erupt.

Friday, 11 November 2011

Turmoil on the net ...

Twitter stories:

  1. How the US Justice Department legally hacked my Twitter account
  2. Twitter's privacy policy and the Wikileaks case
  3. Who gets custody of Twitter a/c when an employee quits?
  4. CIA tracks tweets

And more ...

What is the Social Graph?

Iran says it is fighting Duqu computer virus, similar to one aimed at nuclear site last year.

Zetas Drug Cartel Reportedly Murders Internet Chat Room Users

Online Bullying Really Not That Common

Google cars use massive amounts of data

Footage from test runs of Google driverless cars. Shows what you can achieve once you have the data.

Policy Issues : some possible leads ...

Please find below a collection of links I have collected over the past couple of months, loosely grouped. I hope these may provide food for thought as you prepare your essay for your final assignment.

These links are provided 'as is'. I make no guarantees that the information provided is reliable (it is up to you to judge that). You should not assume that my linking implies endorsement: I may (privately) endorse or condemn the views expoused.

Bill Gates Shaping the Internet is still available for comment on NB.

What are governments (and others) doing?

Neelie Kroes speech at EU Hackathon: on transparency and blocking (video)

"The future of the internet is too important to be left to chance" Foreign Secretary William Hague reflects on the London Conference on Cyberspace

Background Briefing from the US State Department on Hillary Clinton's Participation In the London Conference on Cyberspace

EU Politician Wants Internet Surveillance Built In to Every Operating System

The U.S is seeking detailed info on the trade impact of Chinese policies that block U.S. companies' websites in China

Google asked to remove 135 YouTube videos for 'UK national security issues'

NSA whistleblower Thomas Drake details intelligence cock-ups - 'Government and companies routinely abuse data privacy'

Who is spying on whom? A crowd-sourced Wiki for information on 'the intelligence contracting industry'.

Technological fixes?

Facebook's massive cyber-security system. Discovering what Facebook knows about you.

FBI official calls for secure, alternate net. Telecomix DNS is going next-level with its own decentralized infrastructure to replace the hierarchical DNS currently in place.

Google's advice on on-line safety.

Researchers developing cyber security software to wipe mobile data based on location

How could regulation make a difference?

Judge rules BT must pay for software that will block access to a site that links to free movies.

The worst thing about censorship is what you don't see

Unfortunately, "Google and the Culture of Search" will not be published until 2012. However, you may want to include your thoughts on "search technology’s broader implications for knowledge production and social relations" in your essay. Those without time machines will have to make do with other sources, and the publisher's blurb (follow the link for this), for inspiration:

Can an algorithm be wrong?

Better the devil you know ... What happens if the banks shut down wikileaks?

PROTECT IP Act would break DNS

Law professors say, "PROTECT IP Act is unconstitutional"

EFF on the PROTECT-IP act


Some recent news stories on copyright and IP.
Will Hollywood Break the Internet?
Copyright bill is the 'end of the Internet'.
Warner Bros issued takedowns for files they never saw—and didn't own.


Will Patents kill hte internet?

Hacking pro bono and otherwise

Online hackers threaten to expose drug cartel's secrets.
Denial of service: Wikipedia Italy went on strike against an 'idiotic proposed law' (lang=it : use Google translate).
Video: Attack of the Hactivists

The unmanned aircraft drones that USA uses to kill people in other nations have been hacked.

State-sponsored hacking in Germany

Hackers used a Trojan horse to break into the systems of more than 50 companies, many of them in the chemical and defense sectors. Symantec traces one command-and-control server to China.

How things go wrong

Programmer's error when creating a regular expression leads to calls for a criminal investigation

... the blame can be laid on a poorly-crafted regular expression. In computer science terms, regular expressions (often abbreviated as "regex") are used for complicated forms of text matching and substitution. They rank among the highest forms of programming arcana, primarily because of their flexibility, but are also some of the most prone to bugs.

Crowd-sourcing crime: Crowd-sourcing began as a legitimate tool to leverage the wisdom of the crowds; the same techniques are increasingly being adopted by the criminal underground for nefarious purposes. Video: The Business of Illegal Data
Feds Indict 7 in massive click-fraud scheme that hit 4 Million Computers

Stuxnet Clone 'Duqu' Possibly Preparing Power Plant Attacks. Windows kernel 'zero-day' found in Duqu attack

A week in Internet censorship: Thailand, Sri Lanka, Egypt, and a suit against Amesys for aiding Libyan surveillance.
U.S. Firm, Blue Coat, provides the technology that Syria uses to Block the Web. Blue Coat told investors its reputation could be harmed if foreign gov clients used tech to violate human rights. Washington Post on this Story

Chinese state media says three people have been arrested for "spreading false rumours" online, warning authorities will quash all such activity.

How WikiLeaks complicated the lives of Belarusian dissidents.
Researchers uncover privacy flaws that can reveal users' identities, locations and digital files. Someone hacked Israel's biometric database. Now 9 million people's personal info is on the loose.

Wednesday, 9 November 2011

Today you will explore the use of regular expressions to scrape information from a web page. You will use the regular expression sandpit. Read the instructions on the sandpit, and experiment to make sure you understand how it works.

The regex crib sheet may be helpful, for reference.

The regular expression sandpit lets you enter a regular expression, and then tag the strings that match it with HTML markup. For example, here is some text scraped from
The first part of this exercise uses this text.

8 tbsp olive oil
450 g cooked peeled prawns, (ideally freshly shelled), drained and patted dry
1 medium onion, finely chopped
4 stick celery
2 cloves garlic, finely chopped
1 red chilli, de-seeded and finely chopped
1 bunch of fresh mint, leaves chopped
100 g cherry tomatoes, halved
juice of 1 lemons
4 courgettes, grated
150 ml white wine
black pepper
500 g fresh tagliarini
200 ml fresh vegetable stock
25 g unsalted butter
40 g parmesan, freshly grated
Heat the olive oil in a large, heavy-based frying pan. 
Add the prawns and fry until lightly browned. 
Add the onion, celery, garlic, chilli and mint. Fry until the prawns turn nutty brown. 
Add the cherry tomatoes and cook for 2 minutes. 
Stir in the lemon juice, then the courgettes. Cook for 5 minutes. 
Pour in the white wine, bring to the boil and cook for 2 minutes. Remove from direct heat. 
Bring a large pan of salted water to the boil. Add the fresh tagliarini and cook until al dente, around 2 minutes. Strain. 
Add the strained tagliarini to the prawn mixture, followed by the vegetable stock, butter and Parmesan cheese. 
Mix well together over a medium heat with a wooden spoon until glossy. Season with salt and freshly ground pepper and serve at once.

Most recipes on this site have a similar format.

Paste the text into the regular expression sandpit.

Task 1

First, put the regular expression [0-9].* into the regexp box. Then put li into the tag box. This shows you the result of enclosing each match for the regular expression between tags <li> and </li>.

This almost lists all the ingredients, but not quite. Try using the following regular expressions, 1 by 1. Try to understand what is happening.../p>

  • [0-9].*
  • \n[0-9].*
  • \n([0-9]|juice).*
  • \n([0-9]|juice|black).*

Once you are satisfied that you have successfully picked out all the ingredients, click tag text.

To finish the markup of the ingredients, try the regular expression (<li>\n.*)*</li> first with b as the tag, and then ul. Again click tag text.

Now use similar methods to tag each step in the method with li and the whole sequence of steps with ol.

Complete the markup by tagging the words "Ingredients" and "Method" with the tag h2.

Task 2

Now find a recipe online (choose one that is well-structured). Scrape the text using copy and paste and use appropriate regular expressions and tags to mark it up in HTML.

Task 3

Find the Amazon UK page listing Digital Cameras. Use View Source in your browser to examine the html for this page. How are the prices tagged?

Copy the text from the web page (not from the View Source page) and paste it into the regular expression sandpit replace the Jabberwocky poem.

Can you write a regular expression that selects all the prices on this page and nothing else?

The Amazon page also has model numbers, details of resolution (megapixels), zoom factor (e.g. 8x Optical) and screen size (e.g. 2.7 inch), but these are not tagged specially in the page source.

To tag them you would first have to find them. Use the regular expression sandpit to find a regular expressions for each data type – to find all occurrences of each of each data type.

Task 4

For extra fun, scrape the text from this page at and tag each address in bold.

Assignment 4: Policy and technology

For your final assignment you should write an essay (2,000-3,000 words) on:

Policy and technology: can we save the internet?

The Internet started as a cooperative endeavour based on mutual trust between organisation with common goals. The open publication and flow of information that the Internet supports has been credited with many improvements, for example in education, healthcare, social inclusion, democratic accountability, and business efficiency.
However, the open nature of the internet has also provided new opportunities for criminals, bigots, repressive governments and unscrupulous corporations, to steal, bully, control and exploit. Your first essay for this course discussed some of these issues.

Your essay for this assignment should address some of the technical and regulatory measures that may be used, for good or ill, by governments, corporations and individuals — to impose and evade controls, to launch and defend against attacks. How could and should governments control the internet? Who should have the power to control what you can post and what you can see? How can such powers be exercised, for good and ill? Can they be effective without destroying the freedoms that have, arguably, stimulated global creativity and innovation so successfully?

You must include a list of sources (not included in the word count). You should quote only sparingly, and identify and attribute any direct quotations.

Links to some relevant recent articles will be posted on this blog.

Bill Gates' article (please leave some comments) at
provides some useful historical perspective – what has changed since the turn of the century?

You may also find the following links helpful for current new and views:

You must include a list of sources (not included in the word count). You should quote only sparingly, and identify and attribute any direct quotations.
The hand-in deadline for this assignment is 4pm on 25th November.
You should submit your report via TurnItIn. See email of 21st September for signup details. Once you have registered, you can sign in at the following url:
The deadline for your submission for this final assignment is 1600 UTC on Friday 25th October November. Unless you have good reason for not meeting this deadline, and we have in advance of the deadline agreed an extension, late work will not be marked.