From listening to any kind of survey of what people are talking about in AI right now, we know that privacy is important. And that’s an understatement.
Time and time again, we hear people talking about how they can implement a certain kind of AI system, but they’re worried about the privacy issues involved.
Sometimes looking at the overall landscape in a granular way can show us how to do better.
For example, new systems are increasingly capable of analyzing unstructured data, and aggregating all of the relevant data points into a unified whole.
What does this mean for privacy? Is the unstructured data inherently less personal, and less sensitive?
Think about that large tranche of unstructured data as a catch-all for all sorts of information that may be about anything at all. In that way, it seems like getting data from this wide field would be less intrusive than, say, harvesting a database carefully filled with people’s identifiers and potentially sensitive financial information like account numbers, etc.
But you can still have some pretty private things floating around in unstructured data sets. For example, a letter relevant to somebody’s HIPAA information might look like it doesn’t have any of that included, at least to the human eye. But then when you apply a more powerful engine and get your insights back, you might see that the machine was capable of pulling that sensitive data from some unexpected places.
The social media example is one way to look at this issue. Social media is unstructured data – we know that – and we’re used to seeing all of these stories about ourselves floating around on Facebook or Twitter. You might think, well, it doesn’t have my Social Security number or my bank account number attached. But spearphishers don’t always need these identifiers, if they have the right story. in addition, AI might be able to pull together all sorts of inferences timelines, and implications, and build personal narratives about you or me that are downright scary. Think about how insurance fraud detectives use social media, for instance. It might catch people committing fraud – but how intrusive is it?
So how are new AI systems doing this? If you look at the hardware, we’re just coming into the era where AI can move from more structured data sets, to a general technology taking in so much data that you might think of it as approaching universal knowledge (for those who have ever heard of the LaPlace daemon, that’s either thrilling or terrifying).
At a recent IIA event, Michelle Fang talked about a chip with 900,000 cores and 4 billion transistors that provides for ease of scaling and eliminates parallel programming.
But she pointed out that the systems are also more capable with unstructured data, too. Additional power comes in handy to scour a wider knowledge base and collect what you want.
Quote:
“Today we’re able to run enormous models on a single system. And so we can scale through data parallelism, which is quick and easy. And so there is no need for complex parallel programming libraries like Megatron … as a business, through developers, when they use our systems, they can focus on the AI and not the complex parallel programming. And as a result, they can start and scale their work much more quickly.” – Michelle Fang, citing use cases with Mayo Clinic and others
One lens to look at this is the lens of data governance. You can identify where data is, for example, in AWS object storage, and the metadata that comes along with it. You can start to analyze whether the AI will be able to build sensitive information models from the bits and pieces that it gleans through unstructured data. And you can ask, first and foremost, as so many now do: who owns that data?
As we go, we’ll start to see where the privacy threats lie. Or if we don’t do this, we might see these issues pop up at an anecdotal user level, with people being rightly upset.
I will say that the hardware itself is quite impressive. These processes go far beyond the simple idea of multicore technology. It looks like they’ll make tomorrow’s data center into something you could hold in your hand.
In any case, when we think about unstructured data, we should think about what it looks like when it’s reduced, boiled down and refined, for whatever a machine can figure out about you!
Read the full article here