Open data is the next (or already current) big thing. Is it enough?
We should be doing data science, not (only) for the sake of having good models or nice predictions, but for providing quantified, data driven and assessed evidence to decision makers.
Is a good data science process enough? I would say no. Whatever is the evidence data scientists will be able to provide, such evidence will be affected (or better annotated) by uncertainty, risk, confidence intervals, variance. The role of the decision maker is not to take blindly the outcome of the data science process but to weight properly the risk and the costs.
Let us take an academic example: the doctor deciding whether to prescribe or not a treatment (e.g. a chemotherapy) to a patient. It is not only about the potential success (and risk) of the treatment. It is also about the cost of a false positive (prescribe a treatment and suffer only side effects) and false negative (avoid to prescribe it and deteriorate the patient state).
Eventually, it is the doctor who decides on the basis of
- a model (implicit in his knowledge or made explicit for instance in a statistical model)
- a measure of utility or cost (associated to false positives and false negatives)
whether it is more beneficial for the patient to deliver or not a drug.
If the data that led to the statistical model are (or will presumably be) open and then available in the future for the sake of reproducibility and scientific validation, what about the final choice of the doctor (or more generally of the decision maker)?
Decision making is either irrational or rational. In the first case let us just cross our fingers. In the second case it would deserve a description, a documentation and (why not) an open sharing. I advocate that, like for open data, a comparable (or greater) effort should be deserved to provide tools, repositories and dashboards to edit, store and disseminate open decision models describing
- the decision making setting (date, author, target, expected impact)
- the evidence it relied on (informal knowledge, literature, statistical models)
- in case of statistical evidence, the (open) data that were used for inferring it
- the utility (or cost) function used for the decision
- the decision making process, specifically how the material in points 2., 3. and 4. was used to deliver the final decision
And the confidentiality? The decision model (once formalized) could be kept confidential or have a restricted access if needed. The issue here is not really about the disclosure of sensitive information but more about the degree of reproducibility of a decision. We can only learn from our (or other) errors. Think about political decision makers, democratically required to document and safely store their decisions, and the possibility for a citizen of rerunning their decisions (once disclosed) in a near (or far) future.