Crowd Embeddings: Difference between revisions

Revision as of 14:46, 25 October 2017

Type:	Case Study
Datasets:	Wikipedia edits
Technique:	Supervised machine learning, word embeddings
Developed by:	Jigsaw

Case Study: Perspective API

Perspective API is a machine learning tool developed by the Google-owned company Jigsaw, which aims to identify toxic messages in the comments section of various platforms. The project has worked in collaboration with Wikipedia, The New York Times, The Guardian, and The Economist.

The collaboration between Perspective API and Wikipedia lives under the name Detox. The project is based on a method that combines crowdsourcing and machine learning to analyse personal attacks at scale. Two intentions seem to be at stake: a research into harassments made in the Talk section of Wikipedia, and the creation of the largest annotated dataset for harassments.

The project uses supervised machine learning techniques, a logistic regression algorithm, and two datasets:

95M comments from English Wikipedia talk pages made between 2001–2015
1M annotations by 4000 crowd workers on 100.000 comments form English Wikipedia talk pages, where each comment is annotated 10 times.

Findings: This leads to several interesting findings: while anonymously contributed comments are 6 times more likely to be an attack, they contribute less than half of the attacks. Similarly less than half of attacks come from users with little prior participation; and perhaps surprisingly, approximately 30% of attacks come from registered users with over 100 contributions.

Moreover, the crowdsourced data may also result in other forms of unintended bias.

This brings up key questions for our method and more generally for applications of machine learning to analysis of comments: who defines the truth for the property in question? How much do classifiers vary depending on who is asked? What is the subsequent impact of applying a model with unintended bias to help an online community’s discussion?

The Detox project includes a section on biases, which is published under the name "Fairness".

@@ Line 10: / Line 10: @@
 |}
+==Case Study: Perspective API==
 [https://www.perspectiveapi.com/ Perspective API] is a machine learning tool developed by the Google-owned company Jigsaw, which aims to identify toxic messages in the comments section of various platforms. The project has worked in collaboration with Wikipedia, The New York Times, The Guardian, and The Economist.
-The collaboration between Perspective API and Wikipedia lives under the name Detox [5]. The project is based on a method that combines crowdsourcing and machine learning to analyse personal attacks at scale. [5.3] Two intentions seem to be at stake: a research into harassments made in the Talk section of Wikipedia, and the creation of the largest annotated dataset for harassments.
+The collaboration between Perspective API and Wikipedia lives under the name [https://meta.wikimedia.org/wiki/Research:Detox Detox]. The project is based on a method that combines crowdsourcing and machine learning to analyse personal attacks at scale. Two intentions seem to be at stake: a research into harassments made in the Talk section of Wikipedia, and the creation of the largest annotated dataset for harassments.
-The project uses supervised machine learning techniques, a logistic regression algorithm, and two datasets:
+The project uses supervised machine learning techniques, a logistic regression algorithm, and two [[WikiHarass|datasets]]:
 *95M comments from English Wikipedia talk pages made between 2001–2015
 *1M annotations by 4000 crowd workers on 100.000 comments form English Wikipedia talk pages, where each comment is annotated 10 times.
+'''Findings:'''
 ''This leads  to  several  interesting  findings: while anonymously contributed comments are 6 times more likely to be an attack, they contribute less than half of the attacks.  Similarly less than half of attacks come from users with little prior participation; and perhaps surprisingly, approximately 30% of attacks come from registered users with over 100 contributions.''
 ''Moreover, the crowdsourced data may also result in other forms of unintended bias.''
-questions: This brings up key questions for our method and more generally for applications of machine learning to analysis of comments: who defines the truth for the property in question? How much do classifiers vary depending on who is asked?  What is the subsequent impact of applying a model with unintended bias to help an online community’s discussion?
+This brings up key questions for our method and more generally for applications of machine learning to analysis of comments: who defines the truth for the property in question? How much do classifiers vary depending on who is asked?  What is the subsequent impact of applying a model with unintended bias to help an online community’s discussion?
-The project includes a section on biases, which is published under the name "Fairness".
+The Detox project includes a section on biases, which is published under the name [https://meta.wikimedia.org/wiki/Research:Detox/Fairness "Fairness"].
 [[Category:Algoliterary-Encounters]]

Crowd Embeddings: Difference between revisions

From Algolit

Revision as of 14:46, 25 October 2017

Case Study: Perspective API