After the Deluge
Data Analytics and Privacy
The last time you logged into Amazon, were you spookily impressed that they seemed to know exactly what you were looking for? How on earth did they figure out that the latest Scandinavian crime thriller was uppermost in your mind? After all, all you did the last time was browse the latest DVD releases. But maybe you spent a little time reading the customer feedback comments, or clicked on further recommendations before deciding not to purchase. And didn’t you delete those pesky cookies anyway? Welcome to the world of extremely large scale data mining, where someone, somewhere is probably keeping a very close watch on you.
It is claimed that 90% of the data in the world today has been created in the last two years1. This explosion in data offers new opportunities but also presents tremendous challenges. The opportunities include the generation of new services to enhance quality of life (and the opportunity to pre-order the latest Lady Gaga CD). The challenges arise from the varying sources of data, uncertainty about their provenance and finding ways of extracting information/knowledge from the data and visualising it in ways which are fit for purpose. Overlaid onto these issues are those of personal choice and privacy: Do I want them knowing all there is to know about me? Or maybe I don’t care?
To benefit from the promise of extensive machine intelligence, a brave new world of pervasive data has set upon us. The capture of more and more data makes predictions more accurate and informative, enabling more accurately tailored and hence responsive services. But if we wish to live in a carefree world with such automated convenience, it will only come about through mining private data ... and in huge quantities.
Do I want them
knowing all there
is to know about
me? Or maybe I
Of course, data collection is one thing but information and thence knowledge and understanding are quite another. Throughout history, and latterly through scientific disciplines, mankind has sought to make sense of its surroundings. In recent times this physical surrounding has moved into the virtual or cyber world, presenting massively enlarged data spaces beyond what could be captured, processed and recorded in previous times. The cyber world now contains all sorts of data including virtual manifestations of physical reality, as well as some that are unique to cyber space.
UK government interest in the data explosion can be traced back to the late 1990s when the anticipated arrival of the Large Hadron Collider (LHC, now operational at CERN and homing in, possibly, on the Higgs boson) provided one of the motivations for the Research Councils’ e-Science programme. In the security context, the myriad of data that supported the post-9/11 terrorist investigations led the newly formed US Department of Homeland Security (DHS) to sponsor the emergence of the new discipline of Visual Analytics. Most recently (late 2011) the Engineering and Physical Sciences Research Council (EPSRC) together with the UK Ministry of Defence (MoD) launched a call for proposals to address Data Intensive Systems. The call identified three core topics: (1) extracting meaningful information; (2) safe and secure cloud computing – anticipating that much of the data handling will be in the cloud; and (3) ensuring confidence in collaborative working – given that the processing of ‘big data’ is likely to necessarily involve cooperative working.
Visual Analytics (VA) is the science of making sense of large data sets that, through the use of interactive visualization and query through semantic extraction and data fusion technologies, support the analytic reasoning process. The roadmap for research in this area was established in 20052. Activity in the US has been orchestrated through the National Visualization and Analytics Center3 which has engaged universities through the DHS academic Centres of Excellence at Purdue University (VACCINE4) and Rutgers University (CCICADA5). Since 2009 this activity has included an international collaborative effort with a UK consortium, UKVAC6, of which the Institute for Security Science and Technology (ISST) at Imperial College is a founding partner.
With the threat of a surveillance
society hanging over us, the
challenge is to balance the
protection of the state with the
protection of civil liberties
VA is the use of interactive visualizations to support the human analytic reasoning process, with tools that facilitate dynamic query, analysis, hypothesis formulation and testing, collation and marshalling of evidence for sense-making (colloquially referred to as ‘joining the dots’). VA also requires strong algorithms for ‘smart’ information retrieval, extraction, and concept searching, and new data structures to support data handling, provenance, ad-hoc querying and methods for handling missing data and uncertainty. VA provides the framework for combining data, visualization and human sense-making aspects to create integrated workspaces for analysts.
At ISST, we concentrate on the data analytics components of VA. We are dealing with data that is incomplete, sometimes unreliable and internally inconsistent; it is a mix of data that includes structured and unstructured text, still and video images, audio feeds and computer media. We are developing new algorithms for abstracting the data and analysing relationships in the data. For example, one abstraction is based on clustering and leads to algorithms for detecting sub-communities. It is then possible to study how these evolve and try to identify which external triggers cause individuals to migrate between communities.
Policing Cyberspace – The Surveillance State?
In the national security context, VA could be used, legitimately by authorities under warrant, to tap into private communications and databases as part of intelligence gathering for countering crime and terrorism. But access needs to be tightly controlled. In the UK this has been written into law through RIPA (the Regulation of Investigatory Powers Act 2000), which sets out the law regarding intercept and surveillance, and the limitations on the authorities on what can be gathered and the use to which it can be put.
The massive increase in cybercrime has demanded that police forces establish the capability to respond to what is already taking the form of an on-line arms race. Are the current measures adequate?
In the UK, the Metropolitan Police Service is host to the Police Central E-Crime Unit (PCeU) which investigates and prosecutes on-line crime, but often runs headlong into foreign jurisdictions. This necessitates global agreements to tackle criminals who seamlessly straddle national boundaries. Digital forensics capability has also required a boost in capability and capacity: first finding and retrieving data behind the multitude of cloaking methods readily available before processing it in a manner suitable for presentation to a court of law, following accepted standards for evidence preservation.
Reality Mining goes beyond
the digital footprint idea ... and
into a much finer grained data
space, a digital dust comprised of
almost every instance of a person
captured by electronic means
But policing the internet has to take into account its very soul, the privacy and liberty essential to the free flow of ideas. Those concerned with civil liberties question the role of secret intelligence in an open society. As Brill in the Hollywood film Enemy of the State says: “How do we draw the line... between protection of national security, obviously the government’s need to obtain intelligence data, and the protection of civil liberties?”
This is a question worthy of some debate. With our increasing abilities to automatically mine and analyse huge quantities of data, it is not only governments but also big corporations and even other individuals who want to derive value from using our data (assuming, of course, that we can regard it as ours). This is most often for commercial advantage yet, with the threat of a surveillance society hanging over us, the challenge is to balance the protection of the state with the protection of civil liberties.
The quote from Brill continues: “... particularly the sanctity of my home? You’ve got no right to come into my home!”
Yet the European Electricity and Gas Directives7 mandate the deployment of smart meters in every domestic setting by 2022. Each of those will publish half-hourly read-outs of energy consumption to the energy provider but will be capable of much greater data gathering.
One of the earliest attempts to gather and operate on big data was the US programme for Total Information Awareness8 (TIA, which might be viewed as an attempt to realise the fictional capability depicted in Enemy of the State.) The TIA program set out, in similar fashion to VA, to better detect terrorist operations and to inform US agencies’ responses. The vision was for an ‘architecture’ capable of integrating many other program outputs, which crucially attempted to predict events (incorporating social sciences) rather than simply respond post-event. Despite the acronym being redefined as Terrorist Information Awareness due to much adverse publicity (particularly the issue of intrusion and compilation of dossiers on hundreds of millions of American citizens), the program was terminated in 2003. Nonetheless, TIA elements are largely to be found in current R&D programs funded through DHS, the Department of Defense and others9. Visual Analytics could be regarded as a direct descendent.
How would you feel if a
speeding ticket arrived
in the post based
on measured GPS
coordinates from your
own smart phone?
The modern face of TIA is Reality Mining10, an increasingly referenced emerging technology defined as the collection of machine-sensed environmental data pertaining to human social behaviour. Reality Mining goes beyond the digital footprint idea (i.e. the collection of information left behind as one navigates cyberspace through on-line activity such as social networking, email, e-commerce, etc) and into a much finer grained data space, a digital dust comprised of almost every instance of a person captured by electronic means (e.g. CCTV, online photo archives, etc).
The key ingredient in Reality Mining, as it seeks to predict human behaviour, is to factor in the new discipline of Social Signal Processing11, this is to enable automatic recognition and interpretation of our non-verbal behaviour as a basis for the creation of a socially sensitive machine. But, at its heart, Reality Mining, despite its benign promise in areas such as predicting the spread of disease based on networks of infection, relies on the collection of enormous amounts of deeply personal information, for example by employing pervasive technology such as always-on smart phone cameras and accelerometers. This raises quite naturally the question of protection of our data from those who may wish to use it for other means. Facebook users already need to be careful about broadcasting their holiday plans (for fear of home burglary), but how would you feel if a speeding ticket arrived in the post based on measured GPS coordinates from your own smart phone?
The Right to Privacy
Much has been said, recently, about the growing awareness of individual privacy, with social media tools now seeking to offer more flexibility in privacy setting options for a more informed consumer. But, how much of your personal data are you actually leaving behind as you navigate cyber space? A 2010 survey by internet security company AVG12 found that 35% of newborns had an online presence (email, social media, photos, etc), with 23% having a prenatal presence through their parents’ uploading foetal scans to the web.
So what can an individual do if they wish to be ‘forgotten’? The European Commission is getting into the act, tabling an amendment to Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data13. Included in the proposed amendment would be the ability of a person to have their personal information deleted including, for example, any photograph of them held on a social networking site. The sanctions for non-compliance could be astronomical with millions of Euros in fines being levied. This offers a severe challenge for data handling organisations, whether in government or in the private sector, and especially as such data is hosted more and more within a cloud service environment.
In the era of cloud-
anyone to be truly
certain that their
request for deletion
has been enacted?
As the Vice President of the European Commission, Viviane Reding, said on introducing the draft amendment to the Directive: “The protection of personal data is a fundamental right for all Europeans, but citizens do not always feel in full control of their personal data. My proposals will help build trust in online services because people will be better informed about their rights and in more control of their information ... A strong, clear and uniform legal framework at EU level will help to unleash the potential of the Digital Single Market and foster economic growth, innovation and job creation.”
Prof. Chris Hankin is Director of the Institute for Security Science and Technology (ISST) and a Professor of Computing Science at Imperial College London. Andrew Burton joined the ISST in 2008 as Programme Manager, having previously worked with the Ministry of Defence.
 Thomas J. J. & Cook K. A. (2005) Illuminating the Path: The Research and Development Agenda for Visual Analytics. Washington: National Visualization and Analytics Center.
[3 ] The National Visualization and Analytics Center (http://nvac.pnl.gov/).
 Visual Analytics for Command, Control, and Interoperability Environments. (http://www.purdue.edu/discoverypark/vaccine/).
 Command, Control, and Interoperability Center for Advanced Data Analysis (http://ccicada.org/).
 The UK Visual Analytics Consortium (http://www.eis.mdx.ac.uk/research/idc/UKVAC/Index.htm).
 The European Union Gas and Electricity Directives (http://gala.gre.ac.uk/3629/1/PSIRU_9600_-_2005-10-E-EUDirective.pdf).
 Poindexter, J. (2002) Overview of the Information Awareness Office. [online] Available at: <http://www.fas.org/irp/agency/dod/poindexter.html> [Accessed 18 March 2012].
 Wired.com (2004) US Still Mining Terror Data [online] Available at: <http://www.wired.com/politics/law/news/2004/02/62390> [Accessed 18 March 2012].
 MIT (2009) Reality Mining, Machine Perception and Learning of Complex Social Systems [online] Available at: <http://reality.media.mit.edu/> [Accessed 18 March 2012].
 Imperial Dept of Computing (2009) Intelligent Behaviour Understanding Group, Social Signal Processing [online] Available at: <http://ibug.doc.ic.ac.uk/research/social-signal-processing> [Accessed 18 March 2012].
 Bizreport.com (2010) AVG: Children have online presence before birth [online] Available at: <http://www.bizreport.com/2010/10/avg-children-have-online-presence-before-birth.html> [Accessed 18th March 2012].
 European Commission (2012) Data Protection Day 2012: Safeguarding online privacy rights through modern data protection rules [online] Available at: <http://ec.europa.eu/commission_2010-2014/reding/multimedia/news/2012/01/20120124_en.htm> [Accessed 18th March 2012].
 Cisco Virtual Networking Index (2011) Forecast and Methodology, 2010-2015 [online] Available at: <http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-481360_ns827_Networking_Solutions_White_Paper.html> [Accessed 18th March 2012].