A 2016 survey on ethics in NLP  noted that 44.5% of NLP researchers do not find themselves responsible for the potential usages of the tools they create. How do differences, between the researchers who hold themselves responsible and those who do not, manifest in tangible terms? Possibly, one of the differences might be the readiness to detect and mitigate biases in their models. This leads us to the question: should the entire liability to expose the models for potential biases lie on the shoulders of the coterie working on bias and fairness? Clearly, we do not consider the evaluation of our models to be the responsibility of researchers working exclusively on evaluation – so why not broaden our evaluation to also include a discussion on potential biases, their sources, and potential remedies? For this to be a widespread practice, there needs to be an established norm for evaluation of ethical biases. A few notable contributions to ACL 2020 take a step in this direction by proposing a set of guidelines and formalisms for NLP researchers and practitioners.
Let’s start with thinking what should be built in the first place. When we talk about our models being biased, the implied assumption is that there exists a model. In a keynote address in ACL 2020, Dr. Rachael Tatman opined on what should not be built – the underlying premise being that no system is inevitable, it has to be built. Rachel suggests that each of us should take time to reflect on what technologies we should not contribute towards. She offers a list of considerations: Who is harmed or benefited from the existence of the system? Whether the users of the technology can opt out of its use? Does the system amplify existing inequalities? And more generally, if building the system is the best use of our resources to better the world. Her personal list of systems she won’t build include (a) surveillance technologies (e.g. face recognition models, as they could be misused for targeting specific demographics); (b) deception technologies (e.g. bots posing themselves as humans); and (c) social category detectors to identify race, gender, nationality, etc. of individuals without their consent. It might be a good exercise to take a pause and enumerate systems that you wouldn’t build.
It is also possible that some legitimate systems, nevertheless, violate the underlying principles. One way of unintentionally causing harm is through dual use, i.e. a system developed for one purpose can be abused for another. For example, stylometric analysis can shed light on the provenance of historic texts, but also endangers the anonymity of political dissenters. When the dual use concern is extended ad-infinitum, it could result in slippery slope arguments, e.g. the researchers perfecting object detection are in part responsible for surveillance technologies, or, stretching it even further, early mathematicians are to blame for all the misuses of technology today. Still, one must reasonably anticipate the direct misapplications of the systems they build, and not leave all the ethical concerns to the regulatory policies that are often left to play catch up.
Leins, Lau, and Baldwin study what uses of NLP are appropriate, and on what basis . To assess a work on ethical grounds, they consider (a) data ethics: the data source and procedure used to obtain the data; and (b) dual use. As a case study, they consider a recent work  that uses a neural model for predicting prison term using the case description and past charges laid against an individual. For the dimension of data ethics, they consider issues concerning the privacy of the individuals, unfair (dis)advantages to social groups, appropriateness of the content, demographic characteristics of the annotators, and finally, how frequently will the dataset be updated. Although the original authors (of prison sentencing work) highlight potential ethical concerns with respect to the adoption of the model, they fail to consider data ethics. They point out that although the first names of the defendants are masked, other information such as last names, location references, etc., renders the defendants identifiable. This information can put victims, defendants and their families in harm’s way. Furthermore, the dataset is frozen in time. This could lead to situations where ultimately annulled legal cases are preserved in their original form, plausibly implying guilt of some innocent individuals. For the prison sentencing research in consideration, the dimension of dual use is a moot point as its primary use itself seems problematic.
Authors further question the use of algorithms merely to inform the Supreme Court, rather than automate decision-making. They consider how much weight should be given to the system, and if the biases in the system could lead to inequities in sentencing. They conclude that such models should not be used at all.
While not using algorithms to inform high stakes court decisions seems to be the safer bet, it comes with a huge opportunity cost – acceptance of the status quo, i.e. humans making the decisions. A study  shows that algorithmically identifying high-risk individuals using machine learning can reduce pre-trial jailing rates by over 40% with no increase in crime. Further, they show that judges effectively under-weight key observable variables like prior criminal record.
People are inherently biased. Even though, in day-to-day life we report experiencing emotions like “anger”, “sadness”, and “happiness”, neuroscientific evidence has failed to yield consistent support for the existence of such discrete categories of emotions . Courtroom judgements are not devoid of emotional, cultural and social biases. Since not everyone experiences, expresses, and interprets emotions the same way, this puts minorities at an even greater risk of being misunderstood and hence, discriminated.
A 2017 study  reveals judges routinely make sentencing decisions based on their reading of how a defendant will behave once released, taking cues from whether a defendant “shows remorse”. There is little evidence that humans can evaluate remorse accurately on the basis of facial expressions or other non-verbal indications. Further, if the defendant is innocent, should they be remorseful in the first place?
Lastly, whether and to what extent should we use algorithms to inform the justice system is not a futuristic discussion. On one hand, a predictive policing technology, PredPol, is used by more than 60 police departments in the USA, but on the other hand, the French government has made it illegal to algorithmically analyse any decision made by a judge. The choice is never straightforward. Personally, we are optimistic about machine learning models assisting stakeholders in high-stakes task, however, a lot of work needs to be done to make that possible.
 Fort, Karën, and Alain Couillault. “Yes, we care! results of the ethics and natural language processing surveys.” international Language Resources and Evaluation Conference (LREC) 2016. 2016.
 Mosteller, Frederick, and David L. Wallace. “Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers.” Journal of the American Statistical Association 58.302 (1963): 275-309.
 Lau, Jey Han, and Timothy Baldwin. “Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis?.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
 Huajie Chen, Deng Cai, Wei Dai, Zehui Dai, and Yadong Ding. 2019. Charge-based prison term prediction with deep gating network. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6361–6366, Hong Kong, China.
 Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., & Mullainathan, S. (2018). Human decisions and machine predictions. The quarterly journal of economics, 133(1), 237-293.
 Barrett, L. F., Lindquist, K., Bliss-Moreau, E., Duncan, S., Gendron, M., Mize, J., & Brennan, L. (in press). Of mice and men: Natural kinds of emotion in the mammalian brain? Perspectives on Psychological Science.
 Bandes, Susan A. “Remorse and criminal justice.” Emotion Review 8.1 (2016): 14-19.