Crowd Embeddings: Difference between revisions
From Algolit
Line 10: | Line 10: | ||
|} | |} | ||
+ | ==Case Study: Perspective API== | ||
[https://www.perspectiveapi.com/ Perspective API] is a machine learning tool developed by the Google-owned company Jigsaw, which aims to identify toxic messages in the comments section of various platforms. The project has worked in collaboration with Wikipedia, The New York Times, The Guardian, and The Economist. | [https://www.perspectiveapi.com/ Perspective API] is a machine learning tool developed by the Google-owned company Jigsaw, which aims to identify toxic messages in the comments section of various platforms. The project has worked in collaboration with Wikipedia, The New York Times, The Guardian, and The Economist. | ||
− | |||
+ | The collaboration between Perspective API and Wikipedia lives under the name [https://meta.wikimedia.org/wiki/Research:Detox Detox]. The project is based on a method that combines crowdsourcing and machine learning to analyse personal attacks at scale. Two intentions seem to be at stake: a research into harassments made in the Talk section of Wikipedia, and the creation of the largest annotated dataset for harassments. | ||
− | The project uses supervised machine learning techniques, a logistic regression algorithm, and two datasets: | + | |
+ | The project uses supervised machine learning techniques, a logistic regression algorithm, and two [[WikiHarass|datasets]]: | ||
*95M comments from English Wikipedia talk pages made between 2001–2015 | *95M comments from English Wikipedia talk pages made between 2001–2015 | ||
*1M annotations by 4000 crowd workers on 100.000 comments form English Wikipedia talk pages, where each comment is annotated 10 times. | *1M annotations by 4000 crowd workers on 100.000 comments form English Wikipedia talk pages, where each comment is annotated 10 times. | ||
+ | '''Findings:''' | ||
''This leads to several interesting findings: while anonymously contributed comments are 6 times more likely to be an attack, they contribute less than half of the attacks. Similarly less than half of attacks come from users with little prior participation; and perhaps surprisingly, approximately 30% of attacks come from registered users with over 100 contributions.'' | ''This leads to several interesting findings: while anonymously contributed comments are 6 times more likely to be an attack, they contribute less than half of the attacks. Similarly less than half of attacks come from users with little prior participation; and perhaps surprisingly, approximately 30% of attacks come from registered users with over 100 contributions.'' | ||
''Moreover, the crowdsourced data may also result in other forms of unintended bias.'' | ''Moreover, the crowdsourced data may also result in other forms of unintended bias.'' | ||
− | + | This brings up key questions for our method and more generally for applications of machine learning to analysis of comments: who defines the truth for the property in question? How much do classifiers vary depending on who is asked? What is the subsequent impact of applying a model with unintended bias to help an online community’s discussion? | |
− | The project includes a section on biases, which is published under the name "Fairness". | + | The Detox project includes a section on biases, which is published under the name [https://meta.wikimedia.org/wiki/Research:Detox/Fairness "Fairness"]. |
[[Category:Algoliterary-Encounters]] | [[Category:Algoliterary-Encounters]] |
Revision as of 14:46, 25 October 2017
Type: | Case Study |
Datasets: | Wikipedia edits |
Technique: | Supervised machine learning, word embeddings |
Developed by: | Jigsaw |
Case Study: Perspective API
Perspective API is a machine learning tool developed by the Google-owned company Jigsaw, which aims to identify toxic messages in the comments section of various platforms. The project has worked in collaboration with Wikipedia, The New York Times, The Guardian, and The Economist.
The collaboration between Perspective API and Wikipedia lives under the name Detox. The project is based on a method that combines crowdsourcing and machine learning to analyse personal attacks at scale. Two intentions seem to be at stake: a research into harassments made in the Talk section of Wikipedia, and the creation of the largest annotated dataset for harassments.
The project uses supervised machine learning techniques, a logistic regression algorithm, and two datasets:
- 95M comments from English Wikipedia talk pages made between 2001–2015
- 1M annotations by 4000 crowd workers on 100.000 comments form English Wikipedia talk pages, where each comment is annotated 10 times.
Findings: This leads to several interesting findings: while anonymously contributed comments are 6 times more likely to be an attack, they contribute less than half of the attacks. Similarly less than half of attacks come from users with little prior participation; and perhaps surprisingly, approximately 30% of attacks come from registered users with over 100 contributions.
Moreover, the crowdsourced data may also result in other forms of unintended bias.
This brings up key questions for our method and more generally for applications of machine learning to analysis of comments: who defines the truth for the property in question? How much do classifiers vary depending on who is asked? What is the subsequent impact of applying a model with unintended bias to help an online community’s discussion?
The Detox project includes a section on biases, which is published under the name "Fairness".