OpenMined was delighted to share feedback with the European Commission on the draft of the Delegated Regulation on data access provided for in the DSA, which aims to lay down the specific technical conditions and procedures necessary to enable “vetted researchers” to be provided with access to data at very large online platforms (VLOPs) and very large online search engines (VLOSEs).
Our comments can be found in full here. A portion of them are reproduced below.
Modern PETs allow for a fundamental change in how external research is framed. Current external access conversations narrowly discuss external researchers obtaining “access to data” – assuming that researchers need direct read access or a copy of raw datasets in order to perform their analysis. But data is merely a means to an end. What researchers really want is access to answers. They aren’t after 1 billion tweets. They want to know whether the tweet ranking is biased based on race or gender. They aren’t after ten million video uploads — they want to know whether TikTok’s video feed is driving mental health issues in teenagers. They don’t want a database; they want the final histogram or table of metrics in their final research paper, which will surface an important insight about society. Everything else is a means to that end.
Several PETs focus on facilitating the creation of (verified) answers without seeing the underlying data. The PETs industry hasn’t coalesced on a term for this yet — data spaces, federated learning, trusted research environments, trusted execution environments, data clean rooms, secure enclaves, secure multi-party computation, remote data science, and other terms all encompass this ideal — but we call this structured transparency.
In short, structured transparency is about a new approach: researchers access answers instead of data. We have found that this approach has enormous implications for running external research programs. First and foremost, it does not infringe upon the rights and interests of the data provider, including the protection of their confidential information, in particular, their trade secrets and the security of their services. When an external researcher sees raw data from a VLOP, there’s almost no way to guarantee that they won’t upload it to the dark web, sell it on the side, or use it for use cases other than what they promised. In a structured transparency framework, this is called “the copy problem.” If a data provider shares a copy of a dataset with a researcher, the data provider can no longer control how the researcher will use the dataset. Once a data provider makes a copy of a dataset and gives that copy away, they lose technical control over the information and have to trust that the researcher or any other recipient of the copy will not misuse the information.
Social institutions attempt to prevent people from misusing a piece of information; the United States Government passed HIPAA to protect medical information and enforces that law through various regulations. The European Union has GDPR. California has CCPA. But these are difficult to enforce, as once information is copied, there’s no guarantee a data provider, an oversight authority, or anyone else can find out where the information went, what it was used for, or do anything about it. A researcher can sign every legal agreement under the sun, but in most cases — if they obtain raw data — preventing misuse is broadly unenforceable.
Our ability to copy information is nearly impossible to stop without an incredible reduction in individual liberty. The consequences are often associated with and initially felt by the data providers, who typically make the initial trade-off when deciding whether to share information, weighing the benefits of sharing information with the risks of misuse. However, the long-term consequences are intimately related to people, as the information often relates to personal details about their lives, and the misuse of a copy of their data translates to harm they experience in the digital world, physical world, or both.
If a researcher obtains access to data, data providers’ concerns can be both legitimate and significant. However, if the only thing a researcher ever acquires is a verified answer to a specific question they propose, then this concern is mitigated. This concern is also mitigated by reducing the amount of copies of data available – less attack surface, less potential for misuse, and greater personal privacy.
With respect to the transmission, exchange, storage, and other processing of personal data in the DSA data access portal, the specific technical conditions for data access under the current paradigm contemplated in the draft delegated act would require the portal to serve as a gargantuan data repository created from centralizing copies of data. This means the DSA data access portal must acquire a data storage system comparable to the size of every VLOP and VLOSE combined. In addition, the Commission must commit to paying dozens of engineers to replicate the software infrastructure currently deployed at each VLOP/VLOSE. This is a startlingly high cost. Each VLOP/VLOSE has its own data infrastructure, which has typically required dozens of engineers’ time over multiple years to create and test. As each VLOP/VLOSE develops novel paradigms and interfaces for next-gen data architecture, the DSA data access portal would need to duplicate the work done at every VLOP/VLOSE. This implies that the DSA data access portal would be required not only to have sufficient technical resources to replicate what goes on at all the VLOPs/VLOSEs but also the ability to replicate the infrastructure at each VLOP/VLOSE on an ongoing basis — such that they’re always ready to receive new data. This would be a Herculean effort.
Alternatively, the specific technical conditions for data access under the new data access paradigm introduced above would require the data access portal to serve as a thin orchestration layer that can flexibly integrate with the existing distributed data environment. Converting the DSA data access portal into a thin orchestration layer ensures that the principles of “necessary and proportionate” access are fulfilled through the digital infrastructure’s architecture/system design as well as the formal technical guarantees a system like that offers at the individual project level. Put succinctly, the draft delegated act, as written, operates off an outdated data access paradigm where copies of data must be securely stored by another party to ensure sufficient access. In a data access paradigm that contemplates PETs, copies of data do not need to be created, and data can remain secure in its original point of collection while enjoying the same, if not greater, access efficiency and researcher scale.