Give up on privacy. This is what you should worry about instead

Privacy is an illusion in the world of big data and data mining

Privacy is an illusion. It simply doesn’t exist. You probably don’t believe me. Until recently, it was easy for us to have the illusion of privacy. Now, with accelerating increases in computational power and storage, and the incredible sophistication of data mining and machine learning, that fake veil of “privacy” is about to be torn down–and it will be shown to have been nothing more than a mirage in the first place.

The argument is simple, and comes in two parts. First, there is a syllogism:

Given: There is some set of data {x} that is public
Given: There is a set of data {y} that can be inferred or computed based on {x}

Then: The set of data points {y} must also therefore be considered public.

It’s a little abstract, but it’s tough to dispute. If your address is publicly listed as 1234 Main Street, Unit 1 and 1234 Main Street is publicly listed as an apartment building that only contains rental units, then logically one can conclude that you rent your apartment. The fact that you are a renter is therefore also public information. How could it not be?

The second “given” in the syllogism can be stated as a mathematical function,

{y} = F({x})

This is just another way of saying that there is some function F, some calculation that you can perform, that lets you figure out y from x. If that’s true, it’s difficult to justify any argument that {y} could be private if {x} is not. Sure, one might hope that the calculation of F is really really hard, and the nobody would bother doing it. But that’s not a principled argument.

It’s like what computer security experts refer to as “security through obscurity“: a system isn’t actually secure if you simply “hide” the way to get in and hope nobody finds it. It’s like hiding porn from your spouse by layering it in 7 layers of folders called “New Folder” or “Work Stuff”. It’s not really making your porn files private or secure: it’s just hoping your spouse doesn’t take the time to crack your James Bond-like system.

And as our technology gets better and better, there will eventually be almost no functions F that are too difficult to solve. This leads us to the second part of my argument, which is this hypothesis:

As our technology and computational abilities improve, the set of “private” things that cannot be computed or deduced from public information will become vanishingly small.

I do not have proof of this hypothesis. But I also cannot think of any scenarios that concretely disprove it either. Let’s take a look at some examples of what we think of as “private” information, so that you can see what I mean.

Where you live

“Doxxing” is usually defined as publishing someone’s private information, such as their home phone number or address, on the internet. But most acts of “doxxing” are nothing more than someone finding a person’s phone number or address in one place on the internet (Google, online whitepages, city or state public records, domain name registration companies) and re-posting it to another place on the internet (Facebook, Twitter, 4chan or Reddit). It hardly seems like a “breach of privacy” to move information from one public place to another.

Twitter's policy on Google and doxxingTwitter has even acknowledged this in its approach to doxxing reports. Twitter has decided that if someone posts your address on Twitter, it is not a breach of privacy if they got your address from Google. This makes sense: if your information is already on Google it can hardly be considered “private”. Moreover, this surely isn’t restricted to just Google. If any information that is already somewhere on the internet is already defined as “public”, Twitter’s policy effectively nullifies doxxing.

Moreover, think of how our technology is improving. Think about what is visible in the backgrounds of your selfies. Think about the houses that are visible in the videos you took of your cat out on the front lawn. Could a large-scale data-mining system use publicly available photos to match against your photos to figure out where you live? Even if the answer currently is “no”, you can be sure that eventually that answer will be “yes”.

You don’t take Facebook photos or selfies? That’s fine. Most likely you drive on public roads to and from work every day. All of your movements over public highways and land are public data. Which means that anything that can be inferred from that data is also public information. That means where you live and where you work are both public information.

“Isn’t it stalkerish for someone to be researching where I drive every day?” Maybe in the past it was. In the past, your movements over public highways were only recorded if someone was deliberately recording your movements. But that’s just not the case any more: companies like Avigilon, Cisco and Human Recognition Systems are forming massive networks of video infrastructure that have the ability to record and analyze literally every person and every event visible in urban public spaces, every single moment of the day.

We may not be to the point that just anyone can “research” your traffic movements over public land, but that doesn’t mean they are private. It’s still public data. It just means that gathering all of the {X} to compute {Y} is hard. But as our technology improves, as computational speed and power increases, the ability to casually search and analyze public traffic records will be inevitable.

It bears repeating: if where you live ({y}) can in principle be calculated from public traffic records ({x}), then where you live is already public.  Just because nobody is bothering to do it yet doesn’t mean it’s actually private.

Your personality and state of mind

Speaking of that network of cameras that record your every move over public property: they can also see details of your face, your posture, your movements. With the massive amounts of data, and incredibly intelligent data-mining techniques that we have available today, these public data points will soon be used to compute your state of mind, your mood, and even features of your personality.

Does it sound like science fiction? It’s really not. There is already a software company out there that is being used by call centers to use artificial intelligence to detect subtle features of your tone of voice when you call into a help line that will profile your mood and your interaction style, so you can be connected with a call center service representative who is a “good match” for you.

DirectTV is already using incredibly deep data-mining algorithms to learn things about you based on nothing more than your channel-browsing habits. The millions of micro-data points, ranging from how quickly you flip from channel to channel to which stations you pause on and for how long, can tell DirectTV if you are single or have a spouse, and even if that spouse is in the room with you at the time.

Remember our equation: if a conclusion can be calculated from public data, then that conclusion is also public. The data obtained by DirectTV is not public, but the complex psychological things that they can learn from you illustrates exactly how much you can “give away”, without realizing it, in your casual behavior. Ever little movement, every facial expression, every gesture that can be captured by cameras (or microphones) while you are walking in a public place is public data.

Which means that anything that can be computed from those–from anxiety level to personality disorders–is also public.

Your naked body

Another classic thing that people like to claim is “private” is their naked body. I have actually argued that this is stupid, but hey it’s how people feel.

But here’s the thing: the sheer computational power and number-crunching ability that we will see with our technology over the next few decades will uncover massive new horizons of things that can be simulated and computed about the physical world. We have data on the exact flexibility and tensile properties of fabrics. We have knowledge of the physics of movement. We can simulate the way we expect objects to interact with each other and move against each other.

Do you really think computer simulations will not be able to figure out what your body looks like based on how your clothing hangs on you and moves around you when you walk?

OK. Sure. Keep thinking that.

In the mean time: what else you got? What’s something that is supposedly “private” that cannot in principle be calculated from public data, given enough computational and mathematical power? Let me know.

This is what we should be thinking about instead

If I’m right, and the number of things that cannot be computed from public data is vanishingly small–given sufficiently powerful computational ability–then privacy is a myth. We actually never had privacy: we only had the illusion of privacy because we didn’t have the technological capabilities to draw out all of the available data to their logical conclusions.

But that doesn’t mean you should freak out. It also doesn’t mean that we should just ignore the legal concerns that people associate with “privacy” in our society today. All it means is that we need to re-frame these problems in a different way.

Instead of focusing on privacy, we need to focus on data abuse.

The simplest example of this goes back to doxxing: You don’t need a concept of privacy to protect people from threats and harassment, or their homes from vandalism. Threats, harassment and vandalism are already crimes, and will remain crimes even when we give up the notion of “privacy” entirely.

Thinking about the big data work that Direct TV is doing suggests another cautionary tale: you cannot stop DirectTV from obtaining data about your channel browsing habits, but you can definitely make laws against DirectTV having their policies, rates or service depend on knowledge they gain from those browsing habits. To put it another way: you might not be able to prevent DirectTV from figuring out (based on your television-watching habits) that you are a sad, single man who lives alone with his dog and likes to drink until 2:00 am every night; but we can make laws against DirectTV abusing that knowledge by charging you more than other people for late night skin flicks.

Are you worried about your employer? You don’t need to cling to the concept of “privacy” to protect yourself, just make it illegal for them to make decisions based on personal traits or activities.

Are you worried about the government? Instead of desperately trying to retain “privacy”, just make sure there are restrictions on how they can and cannot act on the information they have.

Information is free and will always be free. In a future of incredible technology and artificial intelligence, everyone will be able to compute the most amazing personal details of everyone else’s life–based on public data–if they want to. The goal should not be to stop that from happening, but to stop people from doing bad things with that knowledge.



7 views shared on this article. Join in...

  1. Benjamin Lim says:

    You mentioned that “And as our technology gets better and better, there will eventually be almost no functions F that are too difficult to solve. ”

    This is not true, just a few days ago, Google decided to list all shared photos on public URLs. The URLs are 40 characters long, giving a total of 10^70 different combinations. A URL that is 41 characters long will have twice as many combinations, so it is indeed possible to simply generate new function F that is just out of reach of the present amount of computing power.

    Source: http://www.theverge.com/2015/6/23/8830977/google-photos-security-public-url-privacy-protected

    “you cannot stop DirectTV from obtaining data about your channel browsing habits”

    This is not true as well. Any device that you put into your home can be placed behind a firewall which can be configured to block certain packets, or even generate random packets as desired if you want to fake your viewing habits. Sure, you need tech skills to configure it but it can definitely can be done.

    “Given: There is some set of data {x} that is public”

    Apart from regulating data use, we need to focus on reducing x as well. I leave you with a quote from a book I recently read.

    “Ubiquitous surveillance means that anyone could be convicted of lawbreaking, once the police set their minds to it. It is incredibly dangerous to live in a world where everything you do can be stored and brought forward as evidence against you at some later date. There is significant danger in allowing the police to dig into these large data sets and find “evidence” of wrongdoing, especially in a country like the US with so many vague and punitive laws, which give prosecutors discretion over whom to charge with what, and with overly broad material witness laws. This is especially true given the expansion of the legally loaded terms “terrorism,” to include conventional criminals, and “weapons of mass destruction,” to include almost anything, including a sawed – off shotgun. The US terminology is so broad that someone who donates $10 to Hamas’s humanitarian arm could be considered a terrorist.” – Bruce Schneier, Data and Goliath

    • Greg Stevens says:

      Thanks for your great comment! Let me try to reply point by point.

      1) There’s no reason to presume that Google’s storage and addressing system is the only one, OR that linear or even polynomial sequential search is the only way to approach pattern matching. In fact, any feature-coded database of images could used an annealing/settling algorithm for pattern matching that could function quickly regardless of N stored.

      2) It depends on where within the system the control point for channel selection happens. If a packet for signaling a channel change has to be received by DirectTv for the channel to actually change, then clearly you can’t block it and still use your tv. 😉

      But that kind of technical detail is a little astray of my main point. To the extent that we take for granted interaction with information in the public arena — whether that is consuming public technological resources or just driving down a street that can be viewed by satellite — we are inherently a source of public data points ourselves.

      • Benjamin Lim says:

        Thanks for your response.

        I would first like to correct a mistake I have made in my previous comment. A URL that is 41 characters long will have approximately 62 times the number of possibilities and not twice as previously mentioned. The number 62 is derived from 26 lowercase and uppercase characters + 10 numbers.

        The URLs used are randomly generated and do not depend on the image stored at that URL. A randomly generated sequence of characters using a cryptographically secure random number generator will have no patterns. It is akin to rolling a die. I can tell you the weight, material, size of the die as well as the results of the past 1000 rolls. The probability of getting a 6 for the next roll is still going to be 1/6 regardless of which pattern matching algorithm you use, simply because there is no pattern. Let me know if you get a different result, it will be useful in Vegas 😉 .

        In fact, die rolls were used during the WWII era to generate random sequences of characters for one time pads which were used by spys to encrypt information.

        • Greg Stevens says:

          I’m sorry, I don’t think my last comment was clear to you. Let me try again.

          Google doesn’t have sole control over public images. I’m the same way that Twitter doesn’t have sole control over tweets, right? Once a tweet is published into the public sphere, it is scraped by at least a dozen services that cache it, process it, repurpose it, and often republish it, as well as including it in their own massive indexed database. Klout does this, Kred does this, private data mining companies do it, and archival sites do it — this is why celebrities can’t get away with “deleting tweets.” 😉

          Anyway, the same it true with any images published to social media or streamed in public forums. They are open to any online service that wants to scrap and archive the data, not just Google. So your point about the way Google addresses their own data is interesting but is not relevant to the question of true accessibility of public image data, just because it’s not like they are the only source. Once a picture is out there it can be and IS stored in a number of places and ways.

          Now, how about the volume of data? I mentioned that there are ways to match patterns that don’t scale with N items stored because they aren’t based on a serial scan. If you’re not familiar with these methods, try researching “simulated annealing” and “content addressable memory”. These are pattern matching techniques that can provide very quick image comparison and retrieval, and depending on how the image data is encoded can be very sophisticated. So the mere “volume of data” is not a computational barrier IF the data is encoded the right way…. And as I started out by saying, nothing is stopping parties other than Google from storing and encoding data in such a way.

          This is what I tried to express in my last comment, I hope this was more clear.

  2. mirella says:

    I have reason to believe that we live in an Intelligent Universe. As I also know that everything is systems inside systems that are self-organizing, I came to ponder over the boastful statemens of Kurzweil, Venge, and the same: once we come to a “singularity” (see Singularity Summits) moment (calculated also based on Moore’s Law), we are likely to step into unknown territory that is a “post-human”, or transhuman – see also Hutter’s emergent “vorld” (virtual + world).

    The most probable consequence of this “quality jump”, accordi g to them, would be a virtual world of virtual humans – “robots”, in common jargon. Also in their view, these should be the ones to fix all our troubles: climate, terrorism, delinquency, warfare.

    I started by saying that I am caught up in this dilema: while everything these guys have predicted seems to fall in place year after year – even the “filing” of population, facilitated by increased computing power, where does human spirit come in?? I said that there is Intelligence in the universe, so Evolution brought us here only to be replaced, or “enhanced” by nanocomputers running in our bloodstream?

    Relative to your article here, I, too, believe that the only way is nit to avoid bring “unveiled”, but to become the true image of what we would like to be seen as: superior, moral, “clean and clear”.
    In my personal view, The Intelligence out there is not overseeing us to promote and evolve superior AI, but to determine us become superior; we are self-organizing systems at the edge of chaos, after all. Ancient religions of the world were right: we are reaching the “end of times”, followed by a singularity, were all human race will shift to a higher stage in evolution, all naturall, though. AI is just the tool of our days to make us more intelligent, more informed!

    !! You made me read your long article, now this long answer is my revenge 🙂

    • Greg Stevens says:

      Haha thanks Mirella, I appreciate the long comment! I always hope my long articles are able to get people thinking. 🙂

      I’m more skeptical than many that the singularity will solve all of humanity’s vices and psychological problems. I have a much easier time imagining a universe where super-intelligent people with plastic bodies simply have DIFFERENT prejudices, lusts, confusions and problems… rather than none at all. But who knows! That may simply be a lacking of my own imagination. 😉

      I think you’ll enjoy the video I did with Gray Scott about the Singularity, by the way, just published today:

      I’d love to get your feedback on it!

      • mirella says:

        OMG! There are so many coincidences happening here! I just watched your podcast, I got so excited about the discussion, as only today I was working at my Pinterest fixing my boards: Fractal Art, Nature and Digital, Mandalas, and today the newest I thought of opening just today, Cymatics; there I discovered the Saturn Hexagon, the one that your guest was talking about. I just posted something on my blog on the envisioned Singularity as soon as I found out about Zuckenberg’s announcement that he was planning to invest a lot in AI to improve facebook performance and services, made earlier this month. It was his announcement what made me think in first place of our total loss of privacy. Then, I was really excited to read your article on the subject.
        I liked the arguments of your guest a lot, but it seems to me he is in love with technology for the love of technogy, while there’s a more simple explanation to fractals, Hexagon, codes and patterns: we are born with them, they are part of our kit for life, as we come into this world. Just to talk “technology”, these are part of our software that gets activated as we progress through life, specific to each of our phases of evolution.There was (Occam obliging), a different mame to that, Archetypes.

        True, this Is going on since the beginning of the world, but these are patterns of transformation and evolution (self-organization), not patterns of Elitism.

        Hutter develops very pertinently on the possible advent of a singularity, rightfully asking, what would be there in a “vorld” ofartificial intelligence, view that this would imply, “them” above and in control (being more intelligent) and us, below (as we are distructive and harmful, don’t know what we want and do, so we endangered life itself – “they” come with solutions, but aren’t going to allow us the lead, lest we repeat the stupid things.

        Our world now is functioning based on our organic needs, and we receive feedback and thrive through the satisfaction of our senses, nutrition and reproduction.

        Once we do not matter anymore, what will computers need to exist for? Just to Be more intelligent than us? what shall drive them, motivate them, once they overcome us and our human intelligence: would they fall in love, perform courtship rituals as a condition to reproduction, then create a whole culture about it, come up with ideologies for ritual and worship, etc, etc , all stuff that we have been doing? You can answer that this is a very simplistic and reductive way to try and peep over that wall. But then, let’s say that they keep on discovering and exploring: what motivates them?? What feedback loop would inform them on intended and unintended consequences?? According to which criteria? What would be good and what would be bad for them, knowing that, in human experience, we do not have a uniform agreement on what is good and satusfactory for us all?

        What’s going to be the engine: just being programed for their superior function would suffice? And by whom? Other computers? and these, who programs them? An elite of humans? But won’t they be supposed to be Less intelligent?? Because, if there be such exceptional people, these would be the Chosen elite, those who used to “love” to keep up with technology.
        I’m afraid I’ll have to stop here, as it is 4 :30 am in my place (not the best time (and shape, my shape) for a discussion.

        In a next part of my argumentation I might try and show that the only reason this technologic singularity idea is wrong, is because intelligence, consciousness, is not “fabricated” in the brain, but it is a fundamental field that we only need to connect ourselves to it – each one of us disposing of a more or less capacitated “machine”, a fine-tuned or a less-well -tuned brain.