Data Labeling using Weak Supervision: In Action

In this blog post, I will share my takeaways and results from using Weak Supervision to label Jigsaw’s Comments data as toxic or non-toxic comments. In my previous blog post, I’ve discussed in detail how human annotations are not only expensive but also not always straightforward to reliably make use of in training machine learning applications. Moreover, annotators may be having to repeatedly read and analyze comments targeted towards the underrepresented group they identify with which can take a toll on their well-being. Give it a read to learn more about why this problem motivates me and why I seek to explore Weak Supervision for labeling this data at scale.

In this blog post, I will detail the end-to-end workflow -

Image with caption

I used 90% of the Jigsaw English train set as my train set. Note that I will not be using the ground truth for training as the idea is to programmatically label these data points using Weak Supervision, and therefore, would treat this as an unlabeled dataset. For the remaining 10%, I will use the labels – 5% for the validation set and 5% makes the sequestered test set that I will only test on once at the very end. My train set has a little over 200K comments.

I used Snorkel, an open-source library for Weak Supervision by Hazy Research Lab at Stanford for this project.

Developing Labeling Functions

In my previous blog post, I described Labeling Functions (LFs) in detail and also shared my initial thoughts on what they would look like for the problem of labeling for toxicity.

Here I describe some LFs that capture heuristics a human annotator might use in determining if a comment is toxic or not. Each LF would take a text input (comment), and based on the logic defined, it would return a label (toxic or non-toxic), or simply not label. I experimented with various kinds of LFs -

Pre-trained Models

If a comment mentioned titles of books, songs, or other pieces of known art, I labeled it as non-toxic as it’s a signal of it being specific to a conversation topic and less likely to be toxic. With a similar intuition, if a comment contains at least 3 mentions of named entities, I label it as non-toxic. For both of these rules, I use SpaCy’s pre-trained Named Entity Recognition models.

from snorkel.preprocess.nlp import SpacyPreprocessor
spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

@labeling_function(pre=[spacy])
def contains_work_of_art(x):
    """If comment contains titles of books, songs, etc., label non-toxic, else abstain"""
    if any([ent.label_ == "WORK_OF_ART" for ent in x.doc.ents]):
        return NONTOXIC
    else:
        return ABSTAIN
    
@labeling_function(pre=[spacy])
def contains_entity(x):
    """If comment contains least 3 mentions of an entity, label non-toxic, else abstain"""
    if len([ent.label_ in ["PERSON", "GPE", "LOC", "ORG", "LAW", "LANGUAGE"] for ent in x.doc.ents])>2:
        return NONTOXIC
    else:
        return ABSTAIN

In another pair of LFs, I use TextBlob’s pre-trained sentiment analysis model. The intuition here is that non-toxic comments are likely to have higher polarity (+1 is positive, -1 is negative) and high subjectivity scores.

from snorkel.preprocess import preprocessor
from textblob import TextBlob
@preprocessor(memoize=True)

def textblob_sentiment(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    x.subjectivity = scores.sentiment.subjectivity
    return x

@labeling_function(pre=[textblob_sentiment])
def textblob_polarity(x):
    """If comment has a polarity score between +0.9 and +1, label non-toxic, else abstain"""
    return NONTOXIC if x.polarity > 0.9 else ABSTAIN

@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(x):
    """If comment has a subjectivity score between +0.7 and +1, label non-toxic, else abstain"""
    return NONTOXIC if x.subjectivity >= 0.7 else ABSTAIN

I used an open-source library, better_profanity to also detect swear words and their various leetspeak versions in the comments. Intuitively, it is likely to make someone uncomfortable and leave the online discussion (which is how Jigsaw defines toxicity).

from better_profanity import profanity
@labeling_function()
def contains_profanity(x):
    """
    If comment contains profanity label toxic, else abstain. 
    Profanity determined using this library - https://github.com/snguyenthanh/better_profanity
    """
    return TOXIC if profanity.contains_profanity(x.text) else ABSTAIN

Pattern Matching

I observed a few commonly occurring phrases, particularly in the non-toxic comments -

I wrote LFs that can match many variations of these phrases using SpaCy. I’ve included examples in docstrings -

from snorkel.preprocess.nlp import SpacyPreprocessor
spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

@labeling_function(pre=[spacy])
def contains_pleaseread(x):
    """
    Will match commonly occuring phrases like - 
    Please read this
    Please read the
    Please read
    """
    matcher = Matcher(nlp.vocab)
    pattern = [{"LEMMA": "please"},
               {"LEMMA": "read"},
               {"LEMMA": "the", "OP": "?"},
               {"LEMMA": "this", "OP": "?"}]
    matcher.add("p1", None, pattern)
    matches = matcher(x.doc)
    return NONTOXIC if len(matches)>0 else ABSTAIN

@labeling_function(pre=[spacy])
def contains_stopvandalizing(x):
    """
    Will match commonly occuring phrases like - 
    stop vandalizing
    do not vandalize
    don't vandalize
    """
    matcher = Matcher(nlp.vocab)
    pattern1 = [{"LEMMA": "do"},
                {"LEMMA": "not"},
                {"LEMMA": "vandalize"}]
    pattern2 = [{"LEMMA": "stop"}, 
                {"LEMMA": "vandalize"}]
    matcher.add("p1", None, pattern1)
    matcher.add("p2", None, pattern2)
    matches = matcher(x.doc)
    return NONTOXIC if len(matches)>0 else ABSTAIN
    
@labeling_function(pre=[spacy])
def contains_harassme(x):
    """
    Will match commonly occuring phrases like - 
    harass me
    harassed me
    harassing me
    """
    matcher = Matcher(nlp.vocab)
    pattern = [{"LOWER": "harass"}, 
               {"LOWER": "me"}]
    matcher.add("p1", None, pattern)
    matches = matcher(x.doc)
    return NONTOXIC if len(matches)>0 else ABSTAIN

@labeling_function(pre=[spacy])
def contains_willreport(x):
    """Will match commonly observed phrases like - 
    report you
    reported you
    reporting you
    reported your
    """
    matcher = Matcher(nlp.vocab)
    pattern = [{"LEMMA": "report"}, 
               {"LEMMA": "you"}]
    matcher.add("p1", None, pattern)
    matches = matcher(x.doc)
    return NONTOXIC if len(matches)>0 else ABSTAIN

I looked for URLs and email addresses in comments, and if found, labeled them non-toxic. The intuition here is that they are likely to be informative (asking readers to find more info or reach out for whatever reason). This might also indicate spams (ads or self-promoting business), but overall less likely to offend anyone due to toxicity.

@labeling_function(pre=[spacy])
def contains_email(x):
    """If comment contains email address, label non-toxic, else abstain"""
    matcher = Matcher(nlp.vocab)
    pattern = [{"LIKE_EMAIL": True}]
    matcher.add("p1", None, pattern)
    matches = matcher(x.doc)
    return NONTOXIC if len(matches)>0 else ABSTAIN
    
@labeling_function(pre=[spacy])
def contains_url(x):
    """If comment contains url, label non-toxic, else abstain"""
    matcher = Matcher(nlp.vocab)
    pattern = [{"LIKE_URL": True}]
    matcher.add("p1", None, pattern)
    matches = matcher(x.doc)
    return NONTOXIC if len(matches)>0 else ABSTAIN

Keyword Searches

I also looked for use of profanity in a comment using an external knowledge base found here. The better_profanity library consumes part of this list so these 2 LFs might be very correlated. Thinking along similar lines, I look for words like “thank you” and “please” (and their variations), and label those as non-toxic, as they are indications of civil conversations (more often than when used sarcastically).

def keyword_lookup(x, keywords, label):
    if any(word in x.text.lower() for word in keywords):
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=TOXIC):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

with open('../../../Downloads/public_datasets/badwords.txt') as f:
    toxic_stopwords = f.readlines()

toxic_stopwords = [x.strip() for x in toxic_stopwords] # len = 458
"""Comments mentioning at least one of Google's Toxic Stopwords 
https://code.google.com/archive/p/badwordslist/downloads are likely toxic"""
keyword_toxic_stopwords = make_keyword_lf(keywords=toxic_stopwords)

keyword_pl = make_keyword_lf(keywords=["please", "plz", "pls", "pl"], label=NONTOXIC)

keyword_thanks = make_keyword_lf(keywords=["thanks", "thank you", "thx", "tx"], label=NONTOXIC)

Miscellaneous

I found that many comments written in all caps were often toxic -

@labeling_function()
def capslock(x):
    """If comment is written in all caps, label toxic, else abstain"""
    return TOXIC if x.text == x.text.upper() else ABSTAIN

Next, I applied these LFs to the train set, obtaining the Label Matrix for the train set. It’s a NumPy array with one column for each LF we create and one row for one data point. We need it to train the generative model in the next step. I used Snorkel’s LFAnalysis utility which summarizes the coverage, overlaps, and conflicts between these LFs, and helps get a sense of how these LFs are doing. Since we’re assuming that we don’t have labels to these comments, we compute accuracies.

As a reminder, LFs are expected to be noisy, conflicting, and potentially correlated. In the table below -

  j Polarity Coverage Overlaps Conflicts
contains_work_of_art 0 [0] 0.08050935913 0.07522590137 0.04089088144
contains_entity 1 [0] 0.345457618 0.2762507828 0.1517291768
textblob_polarity 2 [0] 0.004607493265 0.003598516854 0.0008996292136
textblob_subjectivity 3 [0] 0.1170909669 0.07056870483 0.03657663747
contains_profanity 4 [1] 0.08198057596 0.07962961122 0.05119436961
contains_pleaseread 5 [0] 0.004160163822 0.004160163822 0.001381750947
contains_stopvandalizing 6 [0] 0.005179080887 0.004880861258 0.0005268546776
contains_harassme 7 [0] 0.0002435460302 0.0002435460302 0.0002435460302
contains_willreport 8 [0] 1.99E-05 1.99E-05 1.49E-05
contains_email 9 [0] 0.00224658787 0.001998071513 0.001068620337
contains_url 10 [0] 0.03810749824 0.03216298697 0.01740111534
keyword_toxic_stopwords 11 [1] 0.2880105769 0.2505840134 0.2215821545
keyword_please 12 [0] 0.3814179349 0.2874191079 0.1458443095
keyword_thanks 13 [0] 0.1177719017 0.08879986481 0.03878843305
capslock 14 [1] 0.01294273189 0.008543992366 0.005193991869

Here is a graph to get a sense of the overall coverage of these LFs together.

Image with caption

Our next step is to convert these numerous labels into a final set of probabilistic noise-aware labels.

Training a Generative Model Using Label Matrix

This Label Matrix is all that we need to train a generative model to obtain a final single set of probabilistic labels. It estimates the accuracies of the various LFs, accounting for any potential correlations between them, and how often they label vs abstain. Note that no gold labels are used during the training process. The only information we need is the label matrix, which contains the output of the LFs on our training set. Snorkel’s LabelModel is able to learn weights for the labeling functions using only the label matrix as input. Before we train the generative model, let’s briefly discuss the baseline method - MajorityLabelVoter to combine the results of the LFs. In this method, if more LFs voted “toxic”, we treat comments to be toxic and vice-versa. Ideally, the LFs should not be treated identically – we saw in my previous post that they may be correlated and if we consider a majority vote, some signals might be overrepresented. In the LFs defined above, better_profanity and keyword-based profanity lookup are likely to label a large number of common data points. More generally, we need to denoise the LFs and MajorityLabelVoter does not help with that.

The LabelModel is able to denoise the LFs, estimate their accuracies and weights to output a single set of noise-aware confidence-weighted labels. Notice in the graph below that the labels are probabilistic in nature.

Image with caption

There would still be many comments that cannot be labeled with LFs. In this case, 73.25% of the train set is labeled by one or more LFs. We will use these labels to train a discriminative model for toxic comment classification.

At this point, we can also apply LFs to the validation set and inspect the coverage and empirical accuracies associated with each LF. Based on these we can rethink some of our LFs, and re-apply them to the train set before we train a discriminative model. It’s important to keep in mind a good balance of coverage and accuracy. An LF should label as much of the data as possible, and as accurately as possible.

  j Polarity Coverage Overlaps Conflicts Correct Incorrect Emp. Acc.
contains_work_of_art 0 [0] 0.08185721954 0.07711576311 0.04195741635 877 38 0.9584699454
contains_entity 1 [0] 0.3520307747 0.2842190016 0.1549472177 3705 230 0.9415501906
textblob_polarity 2 [0] 0.003936303453 0.003220611916 0.0006262300948 41 3 0.9318181818
textblob_subjectivity 3 [0] 0.1148684917 0.07058507783 0.03399534801 1016 268 0.7912772586
contains_profanity 4 [1] 0.08427267848 0.08248344963 0.05457147969 588 354 0.6242038217
contains_pleaseread 5 [0] 0.005457147969 0.005457147969 0.001162998748 59 2 0.9672131148
contains_stopvandalizing 6 [0] 0.005546609411 0.005188763643 0.0005367686527 59 3 0.9516129032
contains_harassme 7 [0] 0.0003578457685 0.0003578457685 0.0003578457685 2 2 0.5
contains_willreport 8 [0] 8.95E-05 8.95E-05 0 0 1 0
contains_email 9 [0] 0.003310073358 0.003131150474 0.001431383074 36 1 0.972972973
contains_url 10 [0] 0.0397208803 0.03453211666 0.01851851852 424 20 0.954954955
keyword_toxic_stopwords 11 [1] 0.2912864555 0.2536231884 0.2252639113 730 2526 0.2242014742
keyword_please 12 [0] 0.3956879585 0.2964752192 0.152979066 4158 265 0.9400859145
keyword_thanks 13 [0] 0.115136876 0.08400429415 0.03685811415 1261 26 0.9797979798
capslock 14 [1] 0.01315083199 0.008677759885 0.005546609411 81 66 0.5510204082

From the table above, we can see that some LFs worked a lot better than others. LF contains_entity has pretty good coverage of 35%, and high accuracy of 94%. The LF keyword_thanks has the highest accuracy (97.9%) on the validation set with 11% coverage. contains_pleaseread has low coverage, but very good accuracy. It seems like we can discard contains_willreport from our pool of LFs in future iterations as it has extremely low coverage. LF keyword_toxic_stopwords, which has high coverage, seems to get more labels incorrect than correct. We might need to further inspect this list of words for future iterations.

The output from the LabelModel we trained has a 74.8% accuracy on the validation set of 11K comments.

Training a Discriminative Model Using Probabilistic Labels

Finally, we will then use these probabilistic labels to train a binary classifier for toxic comment classification. This is important so that we can generalize beyond what has been labeled using labeling functions. The discriminative model needs to be able to support probabilistic labels.

I trained a Logistic Regression model using bag of n-grams features and saw an accuracy of 90.5% on the validation set of 11K comments.

Future Work

A lot more can be done with labeling functions. Here are a couple of other ideas and resources if readers would like to extend this work -

I would personally learn more about this domain and write more number of rich labeling functions as it seems like this could be a good sustainable approach to monitoring toxicity on the internet.


This post can be cited as:

@article{neeraj2020wsfortoxicity,
    title = "Data Labeling using Weak Supervision: In Action",
    author = "Neeraj, Trishala",
    journal = "trishalaneeraj.github.io",
    year = "2020",
    url = "https://trishalaneeraj.github.io/2020-07-26/data-labeling-weak-supervision"
}