2017

Who knows me better than I know myself?

Who knows us better than ourselves? No seriously, who does? While the answer to ‘who' may not be surprising, the extent to which may be astonishing. 

We are aware of the vast amount of information is being collected by each our devices. While some of this happens without our explicit knowledge, most of the data is from text and images that we explicitly provide. Within each mobile application, there are various SDKs or javascript snippets that automatically collect and transmit data back to their servers. Desktop browsers and browser-plugins have the means to track and manipulate the content on any page that is browsed. When banking online, the browser and plugins have access to personal financial data including account numbers and balances. They can track online shopping and spend behavior, medical problems, sexual behavior, religious and spiritual beliefs. 

Accessing data through a web browser is inherently less secure than using an application because the content delivery and its consumption are controlled by two different entities with different objectives. A browser may allow plugins to be installed which can track and manipulate content; the website providing the content has little or no control over this. It is now possible to identify a user across devices through cross device tracking techniques; it is now possible to correlate a person's browsing history across different desktop and mobile devices. While I explicitly sign onto my Google account from multiple devices, there are other companies who are able to determine this through other means. For example, if I browse for an Amazon product on my phone, an advertisement for the same product is displayed on my other devices.

In addition to having information collected about us, we voluntarily provide significant amounts of information about ourselves mostly through uploading photos, email and messaging and storing documents online. 

Here is a photo taken in the early 1980’s, which was later scanned and uploaded to Google Photos. 

Google determined this picture was taken at NASA Johnson Space Center in Houston, Texas. This was determined without GPS information embedded in the photo. Google has created a visual representation of the entire world from all the uploaded photos, which has been folded into Google Maps and Google Street View. Every location, building, point of interest has been tagged, analyzed and memorialized. Using this Google can accurately locate where a picture was taken even without GPS coordinates. Additionally, image recognition algorithms have become so good, that Google is now able to identify people across different time periods. The three people in the foreground were accurately identified, thirty years from when the picture was taken. 

Here are some of the implications of these technologies, extrapolating from where we are today.

  1. Using image recognition, Google and Facebook are able identify people from their photos. From uploaded photos, they can create a list of real world connections between people even if others in the photos are not tagged. 
  2. Photos provide a wealth of other information including
    1. Locations where we live and the places we visit.
    2. Personal tastes and preferences including
      1. What we like to wear,
      2. What we like to eat and where,
      3. What we drive,
      4. What we watch, play, listen to as in sports, games, music and films.
    3. Determine a person's health by
      1. Using visual cues to estimate a persons weight over time,
      2. Tracking distance moved (walked, run, cycled),
      3. Tracking sleeping and waking times,
      4. Tracking number of visits to the doctor,
      5. Tracking heartbeat using a fitness tracker if available.
    4. Track relationships between people by analyzing the sentiment from the photos they upload.
  3. From capturing the mouse movements on the screen, determine if a user is right or left handed.
  4. Through the browser track personal financial status, including how much and where each person spends their money, what their bank balances are.
  5. Most personal correspondence has moved online through email and messaging applications. These products provide insights into our deepest and most intimate thoughts, emotions and sentiments.

These companies have a near 360 degree picture of who we are, what we wear and eat, where we travel, the state of our health, our spending patterns and our thoughts and feelings. They are constantly developing new algorithms and techniques to learn more about us with the current data sets. With advances in AI and ML, it is possible to correlate all this information to create deeper analysis. While currently this is used to serve more relevant advertisements, it could have other uses in the near future. For example, an AI assistant can to make recommendations and predictions based on personal knowledge. This AI assistant could recommend a family friendly car upon the arrival of a new child (if there is not on already) or suggest that a person who has been steadily gaining weight and missing work, to go see a doctor. 

However, this information could have more sinister uses. Imagine what could happen if knowledge of a person's physical address coupled with the their current location were to fall into wrong hands. Google, Facebook and Uber may know this directly while others like Amazon, UPS/FedEx could infer this from shopping or delivery patterns. Through Facebook or Instagram, the whole world may know when a person or family is away from their homes for an extended period.

Is the cat out of the bag?

Individually there are certain things we can do to minimize exposure like using native applications (e.g. mobile banking app and not the browser), using browsers in incognito mode and disabling browser plugins.

Legislation has to be strengthened so that ownership of data rests squarely in the hands of the consumers. Data collection should be separated from its usage, i.e. Permissions have to be individually and explicitly granted for collecting and using data. For example, it should be possible for a consumer to allow access to collect location information and use it for routing, but not for advertising. If this does not happen voluntarily, this has to be legislated and audited to verify compliance. Sharing of data between businesses should follow the same guidelines. 

We do not want to rewind to a time before the mobile internet and give up the conveyances that it provides. But while trading privacy for convenience the tradeoff should rest squarely with the consumer.

Compute is ephemeral while data has gravity

The shift from compute to data centric computing is driven by a confluence of two trends; The first is an increase in the data collected and the second is using this data to unpack additional value in the supply chain. 

The explosion in data collection is driven by the increase in the number of computing devices. Historically, this has increased by orders of magnitude with each generation, starting with mainframes, to PC’s and mobile devices. While there were only a handful of mainframe computers and a PC for every few people, mobile devices are ubiquitous; two thirds of adult human beings on the planet possess one. The growth of IOT devices will follow the same exponential trend set by ancestor devices, there will be many IOT devices per person. But unlike ancestor devices, IOT devices will be specialized and mostly autonomous. 

Autonomous edge computing

Traditionally in computing, the value of data increased when it was shared. Excel spreadsheets became more useful when shared with with co-workers, photos and videos when shared with family and friends. However specialized devices collect data about their immediate environment, which may not be useful to another device in a different environment. For example, autonomous cars collect 10GB data for every mile; it neither necessary nor possible to transfer all this data over the internet and back for real time decision making. As data about a car's current environment changes in real time, the data from the past is no longer relevant, and does not need to be stored. Additionally, this raw data is not useful to another car at a different location. Enabled by higher bandwidth at lower latencies, edge computing facilitates faster extraction of value from large amounts of data.

The inability to transfer large amounts of data over the internet will drive collaborative machine learning models like federated learning. Under this model, data collection and processing agents will run at the edge and transfer a summary of their learnings to the cloud. The cloud is responsible for merging (averaging) the learnings and distributing it back to the edges. In the case of autonomous cars, the learnings from each autonomous vehicle will be shared with the cloud. The cloud merges the learnings and redistributes it to all the other autonomous vehicles. 

This trend has already started at Google, where their engineers are working on federated collaborative machine learning without centralizing training data. Apple released a Basic Neural Network Subroutine (BNNS) library enabling neural nets to run on the client. Currently BNNS does not train the neural net, but this is a next logical step. Specialized computers and systems will be built that are data centric, i.e. with the ability to move large amounts of data for processing at very high rates. One of the first examples is Google’s Tensor Processing Unit (or TPU) which outperforms standard processors by an order of magnitude. In the near future every mobile device will have a SOC that is capable of running a fairly complex neural network. Data and applications that consume this data will be collocated, creating autonomous edge computing systems. 

The gravity of data

As the cost of compute has been going down, the big three cloud vendors (AWS, Azure and Google) provide more services around data. Larger amounts of data will need more compute and higher bandwidth at lower latencies to extract value. It is easier to bring compute to the data than the other way around.

These vendors now provide AI and machine learning as a service to extract value from this data. In the near future, it will be possible to automatically analyze and transform the data to provide actionable insights. Think of the raw data as database tables and transformed data as indexes which co-exist with the tables. The vendors will automate data transformation and analysis, locking the data in and making it non portable. Organizations should ensure that the process of value extraction is not dependent on a vendor’s proprietary technology, and the transformed data stays portable.

So in summary

  1. We have shifted from being compute to data centric.
  2. Large, temporal data will drive autonomous edge computing and federated machine learning. 
  3. Enterprises should not use proprietary technology for extracting value from their data.

References

  1. https://www.wired.com/2017/04/building-ai-chip-saved-google-building-dozen-new-data-centers/
  2. http://a16z.com/2016/12/16/the-end-of-cloud-computing/
  3. http://www.zdnet.com/article/data-gravity-the-reason-for-a-clouds-success/