[liberationtech] OpenUp Corporate Data while Protecting Privacy - Open Up?

Fri Oct 31 16:45:26 PDT 2014

OPENUP CORPORATE DATA WHILE PROTECTING PRIVACY

October 31st, 2014

*Stefaan G. Verhulst* <http://thegovlab.org/about/team/stefaan-verhulst/>
 and *David Sangokoya* <http://thegovlab.org/about/team/>, The GovLab, New
York University

Consider a few numbers: By the end of 2014, the number of mobile phone
subscriptions
<http://www.itu.int/net/pressoffice/press_releases/2014/23.aspx> worldwide
is expected to reach 7 billion, nearly equal to the world’s population. More
than 1.82 billion people
<http://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/>communicate
on some form of social network, and almost 14 billion sensor-laden everyday
objects <http://www.emc.com/about/news/press/2014/20140409-01.htm>(trucks,
health monitors, GPS devices, refrigerators, etc.) are now connected and
communicating over the Internet, creating a steady stream of real-time,
machine-generated data.

Much of the data generated by these devices is today controlled by
corporations. These companies are in effect “owners” of terabytes of data
and metadata. Companies use this data to aggregate, analyze, and track
individual preferences, provide more targeted consumer experiences, and add
value to the corporate bottom line.

At the same time, even as we witness a rapid “datafication” of the global
economy, access to data is emerging as an increasingly critical issue,
essential to addressing many of our most important social, economic, and
political challenges. While the rise of the Open Data movement has opened
up over a million datasets around the world, much of this openness is
limited to government (and, to a lesser extent, scientific) data. Access to
corporate data remains extremely limited. This is a lost opportunity. If
corporate data—in the form of Web clicks, tweets, online purchases, sensor
data, call data records, etc.—were made available in a de-identified and
aggregated manner, researchers, public interest organizations, and third
parties would gain greater insights on patterns and trends that could help
inform better policies and lead to greater public good (including combatting
Ebola
<http://www.economist.com/news/leaders/21627623-mobile-phone-records-are-invaluable-tool-combat-ebola-they-should-be-made-available>
).

Corporate data sharing holds tremendous promise. But its potential—and
limitations—are also poorly understood. In what follows, we share early
findings of our efforts to map this emerging open data frontier, along with
a set of reflections on how to safeguard privacy and other citizen and
consumer rights while sharing. Understanding the practice of shared
corporate data—and assessing the associated risks—is an essential step in
increasing access to socially valuable data held by businesses today. This
is a challenge certainly worth exploring during the forthcoming OpenUp
conference <http://www.openup2014.org/>!

*Understanding and classifying current corporate data sharing practices*

Corporate data sharing remains very much a fledgling field. There has been
little rigorous analysis of different ways or impacts of sharing.
Nonetheless, our initial mapping of the landscape suggests there have been
six main categories of activity—i.e., ways of sharing—to date:

*1. Research partnerships,* in which corporations share data with
universities and other research organizations. Through partnerships with
corporate data providers, several researchers organizations are conducting
experiments using de-identification and aggregated samples of consumer
datasets and other sources of data to analyze social trends. For instance,
Safaricom, one of Kenya’s leading mobile companies, shared a year of
de-identified phone data with Harvard researchers to analyze and map how
migration patterns contributed to the spread of malaria in Kenya
<http://www.hsph.harvard.edu/news/press-releases/cell-phone-data-malaria/>.

*2. Prizes and challenges,* in which companies make data available to
qualified applicants—including civil hackers, pro bono data scientists and
other expert users—who compete to develop new apps or discover innovative
uses for the data. Last year, Spain’s regional bank BBVA hosted a contest
<http://www.centrodeinnovacionbbva.com/innovachallenge/inicio> inviting
developers to create applications, services, and content based on anonymous
card transaction data. The first prize went to an application called Qkly
<http://www.centrodeinnovacionbbva.com/en/innovachallenge/michele-trevisiol-oscar-marin-and-alejandro-hernandez>,
which helps users manage time by estimating what time of day a given site
or destination will be most overcrowded (thus helping users, for example,
avoid lines).

*3. Trusted intermediaries,* where companies share data with a limited
number of known partners for analysis, modeling, and other value chain
activities. For example, companies from the consumer packaged goods,
retail, and over-the-counter health care industries often share data with
firms such as Information Resources, Inc. (IRI), a data analytics and
strategy firm that provides business intelligence and predictive analytics
solutions <http://www.iriworldwide.com/>.

*4. Application programming interfaces (APIs)*, which enable access to
streams of corporate data for developers and others to conduct testing,
product development, and data analytics. Major health insurance companies,
such as Kaiser and Aetna, use APIs to create more integrated ecosystems
across mobile applications and devices
<https://www.carepass.com/carepass/getstarted;jsessionid=F07A85D775686A49A446FE2116DDD311>
for
consumers. Aetna’s CarePass API gives consumers access to their personal
data to sync with wearable health platforms such as FitBit or the Apple
Watch.

*5. Intelligence products,* where companies share (often aggregated) data
that provides general insight into market conditions, customer demographic
information, or other broad trends. Google shares search query-based data
<http://www.google.org/flutrends/> in conjunction with data from the US
Centers for Disease Control in order to estimate levels of influenza
activity across the country over time.

*6. Corporate Data cooperatives or pooling,* in which corporations—and
other important dataholders, such as government agencies—group together to
create “collaborative databases” with shared data resources. For
example, through
its Accelerating Medicines Partnership
<http://www.nature.com/news/pharma-firms-join-nih-on-drug-development-1.14672>,
the US National Institutes of Health (NIH) is helping organize data pooling
among the world’s largest biopharmaceutical companies in order to identify
promising drug and diagnostic targets for Alzheimer’s disease, systemic
lupus erythematosus, rheumatoid arthritis, and diabetes.

*Assessing risks of corporate data sharing*

Although the shared corporate data offers several benefits for researchers,
public interest organizations, and other companies, there do exist risks,
especially regarding personally identifiable information (PII). When
aggregated, PII can serve to help understand trends and broad demographic
patterns. But if PII is inadequately scrubbed and aggregated data is linked
to specific individuals, this can lead to identity theft, discrimination,
profiling, and other violations of individual freedom. It can also lead to
significant legal ramifications for corporate data providers.

Based on our initial research, we have found that most companies are aware
of these risks and have taken steps to de-identify aggregated datasets.
Such steps include partnerships with academic experts, and experimenting
with new de-identification methods. It is important to point out, however,
that there exist no industry standards or widely accepted Best Practices
for de-identification of corporate data. Complete anonymization would of
course provide the safest way to scrub datasets of PII, but it might also
reduce the “granularity” and thus usefulness of the data.

Participants at a recent Responsible Data Forum
<http://www.unglobalpulse.org/rdf-private-sector-data-sharing> held at the
Rockefeller Foundation, in New York City, suggested creating a “starter
kit” (or “how-to guide”) for private sector companies aiming to open access
to data while protecting privacy. In addition to this starter kit,
companies, researchers, and governments could also start developing a
safety ranking system based on a “taxonomy of harms.” More generally, more
thought and discussion is required to determine de-identification methods
and standards (including on ways to prevent re-identification).

*Mapping the next frontier*

Beyond the broad taxonomies presented above, there exists almost no
systematic analysis of the practice, risks, and impact of corporate data
sharing. A more comprehensive mapping of the field of corporate data
sharing is urgently needed. Such a mapping would draw on a wide range of
case studies and examples to identify opportunities and gaps, evaluate
risks, provide evidence of impact, determine best practices in
de-identification techniques and privacy frameworks, and ultimately inspire
more corporations to allow access to their data. “Opening Up” corporate
data is the next frontier of open data. The potential societal benefits
that could flow from accessing corporate data are tremendous—but they will
only be realized when the public (consumers, citizens, and companies
themselves) have solid evidence of those benefits as well as trust in the
way data is shared and accessed.

*Stefaan G. Verhulst is the co-founder and chief of research and
development at The GovLab, New York University. David Sangokoya is a
research fellow at The GovLab, New York University.*

http://www.openup2014.org/openup-corporate-data-protecting-privacy/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20141031/e48302fe/attachment.html>