Multi-Label classification using AllenNLP

6 min readMay 10, 2021

Ever since I started using AllenNLP, there was no turning back. It’s not only easy to code but also helps in setting up all variations in deep learning experiments. In this article, we will go through a multi-label classification problem using AllenNLP. You will be surprised that we can implement multiple research papers with just a few lines of code. So, without further ado, let’s get started.

Problem Statement

Given a sentence or group of statements, the goal is to classify them into one or many toxic labels. Here is an example:

Background for the problem originates from the multitude of online forums, where-in people participate actively and make comments. As the comments sometimes may be abusive, insulting, or even hate-based, it comes to the responsibility of the hosting organizations to ensure that these conversations are not of the negative type. The task was thus to build a model which could make a prediction to classify the comments into various categories such as toxic, severe-toxic, obscene, threat, insult or identity-hate

Dataset

We are provided with a Kaggle dataset containing a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are

toxic
severe toxic
obscene
threat
insult
identity hate

We are going to create a model which predicts the probability of each type of toxicity for each comment.

Getting started

The only dependency for this tutorial is going to be AllenNLP, which can be installed with pip. First, make sure that you have a clean Python 3.6 or 3.7 virtual environment, and the install with pip. For example,

# Assuming conda is installed
conda create -n bert_snli python=3.7 -y
source activate bert_snli
pip install allennlp

Exploratory data analysis

A Quick exploratory analysis shows that We got more toxic examples than other categories. Threat is least represented. Does it show a class imbalance problem too? Maybe. In this notebook, we are not going to handle the class imbalance problem.

One quick thing to observe that Severe Toxic has fewer examples than Toxic. For better scores, we are going to assign Severe Toxic examples to Toxic too.

Approach

We’ll be using AllenNLP to build this model. Here is a small data flow diagram of the system. AllenNLP divides the whole process in the following way:

TokenEmbedder
Seq2Seq Encoder
Seq2Vec Encoder

TokenEmbedder

TokenEmbedder is responsible to encode English words into word vectors. Popular options here to use are Glove, Fastest, Word2Vec, Elmo, Bert. AllenNLP provides an easy-to-use interface to use any of this technique.

Seq2SeqEncoder

Seq2SeqEncoder is responsible to encode context from word vectors. The simplest example is to use LSTM, GRU neural network which has an inbuilt gate mechanism to forget, remember and store history. One of the caveats of using LSTM, GRU is that it can cater only one direction at one time ( either forward, backward). A better way to encode Sequence from a sequence is transformer architecture. With AllenNLP, you don’t worry, it’s just a parameter name for you to pass for Seq2Seq architecture and results will speak for themselves.

Seq2VecEncoder

This is the most important layer which summaries all the embeddings from words and converts them into a single dimension vector.

Code

Due to space constraints, we will be looking at one single approach which has given me the best result till now on this dataset. We’re going to build this model step by step.

Dataset Reader

Class ToxicReader(DatasetReader):
	''' Reader for toxic classification dataset '''
	def __init__()
		...
	
	def text_to_instance(self, 
                        text: str,
                        labels: List[str] = None)->Instance:
        # first clean text
        text = clean_text(text)

        if self._max_sequence_length is not None:
            text = text[:self._max_sequence_length]
        
        tokenized_text = self._tokenizer.tokenize(text)
        text_field = TextField(tokenized_text, self._token_indexer)
        fields = {'text': text_field}
        if labels or self.fill_in_empty_labels:
            labels = labels or [0, 0, 0, 0, 0, 0]

            toxic, severe_toxic, obscene , threat, insult, identity_hate = labels
            fields['labels'] = ListField([
                LabelField(int(toxic), skip_indexing=True),
                LabelField(int(severe_toxic), skip_indexing=True),
                LabelField(int(obscene), skip_indexing=True),
                LabelField(int(threat), skip_indexing=True),
                LabelField(int(insult), skip_indexing=True),
                LabelField(int(identity_hate), skip_indexing=True)
            ])

            return Instance(fields)

Full code is available here. This helps in parsing the data files, clean the sentences from unnecessary characters, truncates them to a maximum length ( argument to the class ), and returns an Instance Variable which contains both input & optional target values.

You can read the whole dataset by specifying the path to dataset files as the argument for the read() method as in:

reader = ToxicReader()
train_ds = reader.read('./train.csv')
valid_ds = reader.read('./valid.csv')

Modelling

To define the model, we’ll be using JSONNET configuration. AlleNLP provides an easy-to-use interface to declare variables and architectures using JSON files.

"model":{
    "type": "toxic",
    "text_field_embedder":{
        "token_embedders":{
            "tokens1":{
                "type": "embedding",
                "pretrained_file": "input/fatsttext-common-crawl/crawl-300d-2M/crawl-300d-2M.vec",
                "embedding_dim": fasttext_embedding_dim,
                "trainable": false
            },
            "tokens2": {
                    "type": "embedding",
                    "pretrained_file": "input/glove-stanford/glove.twitter.27B.200d.txt",
                    "embedding_dim": glove_embedding_dim,
                    "trainable": false
                }
        },
    },
    "encoder":{
            "type": "swem",
            "emebdding_dim": fasttext_embedding_dim + glove_embedding_dim
    },
    "classifier_feedforward":{
        "input_dim": lstm_hidden_size*2,
        "num_layers": 2,
        "hidden_dims": [200, 6],
        "activations": ["tanh", "linear"],
        "dropout": [0.2, 0.0],
    }
},

Model is pretty straightforward. It uses text field embedder, encoder , classifier feedforward, MultiLabelMarginLoss

Let’s go through it one by one.

Text Field Embedder is used to encode text tokens to embedding. In this model, we want to use GLOVE and FastText embedding as pre-trained embedding for this layer. Luckily for us, we don’t need to deal with any mappings, configuration. AllenNLP provides an easy-to-use interface for us to leverage on.

encoder is an example of Seq2Vec which is used to convert a sequence of word embeddings to contextual sentence embeddings which captures the context. Here we are using a custom encoder called SWEM. Read more about here. This concatenates both the embeddings using max and mean pooling.

classifier feedforward is used to decode the sentence vector into target labels. Here, we use multi-layer linear feed-forward network with dropouts.

Believe it or not, That’s it. This is what we need to code. Now, let’s move to the interesting part, the training of this beast.

Training

Training is simplified here. We are going to use AllenNLP train command to help us train this model.

allennlp train config/toxic_swem.jsonnet -s output -f --include-package nlp

This will start training the model and print metrics at each epoch. We can either look at the text data or tensor board integration to plot the loss metrics.

Predict

What fun is there, if we can’t use the model to predict labels for new sentences. Here we will use Predictor class to implement our own Predictor for toxic classification.

class ToxicCommentPredictor:
    def __init__(self, model: Model, iterator,
                 cuda_device: int=-1) -> None:
        self.model = model
        self.iterator = iterator
        self.cuda_device = cuda_device

    def _extract_data(self, batch) -> np.ndarray:
        out_dict = self.model(**batch)
        return out_dict["probs"].detach().cpu().numpy()

    def predict(self, ds: Iterable[Instance]) -> np.ndarray:
        pred_generator = self.iterator(ds, num_epochs=1, shuffle=False)
        self.model.eval()
        pred_generator_tqdm = tqdm(pred_generator,
                                   total=self.iterator.get_num_batches(ds))
        preds = []
        with torch.no_grad():
            for batch in pred_generator_tqdm:
                batch = nn_util.move_to_device(batch, self.cuda_device)
                preds.append(self._extract_data(batch))
        return np.concatenate(preds, axis=0)

And that’s all there to it. We got ourselves a great predictor that can test our newly trained model. To predict a new label, we can use the following statement:

allennlp predict -o {'sentence': 'I hate you !'} -s output -f --include-package nlp

That’s it! Hope you learned something new about AllenNLP through this blog. Don’t forget to give thumbs up if you like this post. The whole code is available here