OpenMined was delighted to respond to the European Commission’s call for evidence on contemplated delegated regulation on data access provided for in the DSA article 40.
Our comments can be found in full here. A portion of them are reproduced below.
Our response will only focus on evidence and insights related to the operational execution of facilitating external access to internal systems. This focused approach is meaningfully tailored to a small number of your prompts. However, while our expertise may be focused on this one specific area, we are in solidarity with several other submissions:
- European Tech Alliance (evidence)
- Centre for Democracy & Technology, Europe Office (evidence).
- Mozilla Foundation and co-signers (evidence)
Taken together, these and other comments already cover a great deal of your requests for evidence. We do not cover what these pieces of feedback already describe. We only provide brief extensions within our specific area of expertise – facilitating external access.
Our Extensions
The first extension we’d like to mention is one of framing. Modern PETs allow for a fundamental change in how external research is framed. In almost every article you will read — and possibly every conversation you have — people discuss external researchers obtaining “access to data”. This assumes that researchers need access to raw datasets in order to perform their analysis.
But data is merely a means to an end. What researchers really want is access to answers. They aren’t after 1 billion tweets. They want to know whether the tweet ranking is biased based on race or gender. They aren’t after ten million video uploads — they want to know whether social media video feeds are driving mental health issues in teenagers. They don’t want a database, they want the final histogram or table of metrics that go in an academic paper — surfacing an important insight about society. Everything else is a means to that end.
Several PETs focus on facilitating the creation of (verified) answers without seeing the underlying data. The PETs industry hasn’t coalesced on a term for this yet — data spaces, federated learning, trusted research environments, trusted execution environments, data clean rooms, secure enclaves, secure multi-party computation, remote data science, and other terms all encompass this ideal — but we call this structured transparency. It’s also a central issue within the PETs reports mentioned above by the Royal Society, the United Nations, and the United States Government.
In short, structured transparency is about a new approach: researchers access answers instead of data. We have found that this approach has enormous implications for running external researcher programs. First and foremost, it strongly curtails privacy, security, and IP concerns. When an external researcher sees raw data from a VLOP, there’s almost no way to guarantee that they won’t upload it to the dark web, sell it on the side, or use it for use cases other than what they promised (in structured transparency jargon, this is called “the copy problem”). A researcher can sign every legal agreement under the sun, but in most cases — if they obtain raw data — preventing misuse is broadly unenforceable. If a researcher obtains access to underlying data, privacy, security, and IP risks are significant and legitimate. However, if the only thing a researcher ever acquires is a verified answer to a specific question they propose, then this risk is mitigated.
Mitigating privacy, security, and IP concerns in this way can also have a profound impact on the project approval process. Instead of a VLOP approving, “Am I going to show this person a bunch of my data — which they claim to want to see for reason X, but which might also expose risk Y?”, the VLOP simply approves, “Am I going to allow the answer to this question to be released?”. In a sense, this allows the conversation between a regulator and VLOP to get straight to the point. It’s about whether a researcher’s question should be answered, and less about tradeoffs between answering those questions and the risk of misuse on the side1. This means project approvals can go faster and be subject to fewer back-and-forth subjective debates on nuanced access-vs-risk tradeoffs.
This also has significant implications for researcher accessibility. When access to data must happen directly, inquiry can be limited only to high-trust researchers within major academic or civil society institutions. Especially if travel to a secure facility is required, this can narrow the field to only include individuals with sufficient funding and pedigree to participate. However, if an external researcher does not obtain direct access to data, they can receive their answer remotely, lowering the need for a trust-based system and the costs that affording that trust can require.
This also has significant implications for data expatriation. Several responses to this call for evidence draw attention to the risk of law enforcement using these interfaces to obtain data on European citizens and that policies should be enacted to prevent this. While navigating the nuances of law enforcement’s role in society is not OpenMined’s specialty, we do highlight that — if the external research platform instead only releases answers (instead of datasets), the risk that data will be re-used for other purposes is mitigated.
It also has significant implications for the cost of scaling such a program. With reduced reliance on background checks, secure (physical) facilities to which researchers must travel, and back-and-forth negotiations about whether the ends of a project justify the means of access to raw/sensitive data, the costs for all participants are reduced.
Finally, a system that facilitates access to answers instead of datasets is a win for all parties. For researchers, it means more researchers can participate, they can get their answers faster, and their projects are more likely to be approved. For regulators, they get to avoid the moral dilemmas present when giving researchers access to mis-usable data and get to support a richer ecosystem of algorithmic research — with more (and more diverse) participating entities. For VLOPs, they can avoid risking the privacy of their customers, the security of their platforms, and the IP present in their systems — while facilitating the creation of answers to important questions. This can help all stakeholders build and refine the best possible online platforms.
Naturally, this begs the question, “Is it possible for a researcher to generate a reliable, verified result without seeing the underlying data?”. While expounding on the full technical aspects of such a system is out of scope for this submission, for those who may be interested, we do offer a free, non-technical course on structured transparency that provides greater detail on the underlying technical principles that make this possible. To summarize, it is possible for an external researcher to prepare statistical / data science code, send that code to an online platform, and receive the result of that code back from the online platform with strong evidence that it was run in the way the external researcher desired. There are a tremendous number of details which facilitate that “verification” process amidst other challenging aspects of the interaction (i.e., data joins across multiple parties, protection against reverse-engineering outputs to learn about the data that generated them, etc.). While this submission is not the venue for such details, we are happy to provide more material on this matter if it is helpful. We have significant experience with this new paradigm — wherein external researchers generate verified results without seeing data — and we are confident in its utility for the Commission in relation to the data access program as envisaged by Article 40 of the DSA.
I will finish by saying that we have witnessed these desirable properties in our own experience with structured transparency technologies. While these technologies are still emerging, they are at an appropriate level of maturity for facilitating external researcher access to internal algorithms and datasets; and should be highly considered by the Commission in their process of establishing the conditions for which data is shared between researchers and VLOPs.
1⬩This still can be a complex issue in some cases. Please see the structured transparency paper for important aspects including those around input privacy, output privacy, and the broader governance of information. In short, sometimes a question can be posed which directly causes a privacy, security, or IP issue, such as “What is user X’s internet protocol address?”, “What is the password to this system?”, or “What is your company’s secret sauce?”. The point here is that structured transparency tools bring this debate into focus. The external and internal representatives talk directly about what questions should be answered as opposed to being bogged down in questions of trust, data access, and the myriad of ways a raw dataset could be subsequently misused.