x-tagger is a Natural Language Processing library for token classification (POS Tagging etc.). The reason why I called “in its simplest form” is the highest abstraction of it. You can train models and make inference in 5-10 lines of code. Other powerful feature of x-tagger is that it support most common dataset types. What does it mean? For example, torchtext is a common library for PyTorch for Natural Language Processing. Or, it is easy to train huggingface transformers models with huggingface datasets.

x-tagger packs all of these powerful features together. To train a x-tagger model, you need a most simplest form of a POS tagging dataset. In this post, I call it as “x-tagger dataset” but it is nothing but list of tuple lists:

[
[('It', 'PRON'), ('was', 'VERB'), ('outrageous', 'ADJ'), ('.', '.')],

[('``', '.'), ('Both', 'DET'), ('sides', 'NOUN'), ('are', 'VERB'), 
 ('taking', 'VERB'), ('action', 'NOUN'), ('.', '.'), ("''", '.')],
...
]

x-tagger is currently in beta release. It supports only Hidden Markov Model, Long Short-Term Memory and BERT.

Tagging With Hidden Markov ModelPermalink

Hidden Markov Model (HMM) is a statistical Markov model with Markov assumption:

P(qi=βq1q2...qi1)p(qi=βqi1)

Hidden Markov Model for tagging allows us to talk about both observed words (events) and part-of-speech tags (hidden events) that we think of as causal factors in our probabilistic model. Formally defined:

Q=q1q2q3...qnstatesA=a11...aij...anntransition probability matrixO=o1o2...oTa sequence of T observationsB=bi(ot)emission probabilitiesπ=π1,...,πninitial distribution over states

For token classification, transition probability means “probability of tag ti with observing ti1”:

p(titi1)=count(ti1,ti)count(ti1)

and emission probability means “probability of word wi with observing its tag ti”:

p(witi)=count(ti,wi)count(ti)

The “training procedure” of HMM is calculating emission and transition probabilities. This probabilities is obtained from tagged training dataset. The formulations of these probabilites is based on bigram Markov assumption. x-tagger has bigram and trigram options.

Tagging unobserved samples or evaluation is done by Viterbi Decoding with dynamical programming in x-tagger.

The decoding is to choose the tag sequence t that is most probable given the observation sequence of n words w:

ˆt=argmaxtp(tw)

using Bayes’ rule, we have:

ˆt=argmaxtp(wt)p(t)p(w)=argmaxtp(wt)p(t)

The probabilities are obtained from bigram assumption that was mentioned above

p(wt)ni=1p(witi)p(t)ni=1p(titi1)

plugging those equations, we have

t=argmaxtp(wt)p(t)=ni=1p(witi)p(titi1)

Viterbi decoder can be implemented with dynamical programming for tagging unobserved samples or evaluation:

x-tagger has high level abstraction for Hidden Markov Model, Viterbi Decoding and its extensions

#pip install xtagger
import nltk
from sklearn.model_selection import train_test_split

from xtagger import HiddenMarkovModel

data = list(nltk.corpus.treebank.tagged_sents(tagset='universal'))
train_set,test_set =train_test_split(data,train_size=0.8,test_size=0.2)

hmm = HiddenMarkovModel(extend_to = "bigram")
hmm.fit(train_set)
hmm.evaluate(test_set, random_size=10, seed=120)

#Accuracy: 90.41%

s = ["There", "are", "no", "two", "words", "in", "the", \
"English", "language", "more", "harmful", "than", "good", "job"]

hmm.predict(s)

and the output will be

[('There', 'DET'), ('are', 'VERB'), ('no', 'DET'),
 ('two', 'NUM'), ('words', 'NOUN'), ('in', 'ADP'),
 ('the', 'DET'), ('English', 'ADJ'), ('language', 'NOUN'),
 ('more', 'ADV'), ('harmful', 'ADV'), ('than', 'ADP'),
 ('good', 'ADJ'), ('job', 'NOUN')]

extend_to parameter in initialization can take 3 value: bigram, trigram and deleted_interpolation (see appendix for deleted interpolation).

Tagging With LSTMPermalink

Second model of x-tagger is LSTM (both unidirectional and bidirectional). For bidirectional case, given a sequence of N words (w1,w2,...,wN), a forward part-of-speech tagger computes the probability of the sequence by modeling the probability of tag tk given the history ((w1,t1),(w2,t2),...(wk1,tk1))

p(t1,t2,...,tN)=Nk=1p(tkw1,w2,...,wk1,t1,t2,...,tk1)

The backward part-of-speech tagger is similar to forward part-of-speech tagger:

p(t1,t2,...,tN)=Nk=1p(tkwk+1,wk+2,...,wN,tk+1,tk+2,...,tN)

And the formulation jointly maximizes the log-likelihood of the forward and backward directions:

Nk=1(logp(tkw1,w2,...,wk1,t1,t2,...,tk1;Θx,ΘLSTM,Θs) +logp(tkwk+1,wk+2,...,wN,tk+1,tk+2,...,tN;Θx,ΘLSTM,Θs))
import torch

from xtagger import LSTMForTagging
from xtagger import xtagger_dataset_to_df, df_to_torchtext_data

df_train = xtagger_dataset_to_df(train_set)
df_test = xtagger_dataset_to_df(test_set)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_iterator, valid_iterator, test_iterator, TEXT, TAGS = df_to_torchtext_data(df_train, df_test, device, batch_size=32)

#Number of training examples: 3131
#Number of testing examples: 783
#Unique tokens in TEXT vocabulary: 10133
#Unique tokens in TAGS vocabulary: 13

model = LSTMForTagging(TEXT, TAGS, cuda=True)
model.fit(train_iterator, test_iterator)

#Accuracy 95.93%

s = ["Oh", "my", "dear", "God", "are", "you", \
"one", "of", "those", "single", "tear," "people"]

model.predict(s)

and the output will be

[('oh', 'X'), ('my', 'PRON'), ('dear', 'ADJ'), 
('God', 'NOUN'), ('are', 'VERB'), ('you', 'PRON'), 
('one', 'NUM'), ('of', 'ADP'), ('those', 'DET'), 
('single', 'ADJ'), ('tear', 'NOUN'), ('people', 'NOUN')]

Tagging With BERTPermalink

BERT similarly maximizes the likelihood of bidirectional likelihood function. x-tagger use huggingface transformers to fine-tune BERT for part-of-speech tagging

import nltk
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

import torch
from xtagger import LSTMForTagging
from xtagger import xtagger_dataset_to_df, df_to_hf_dataset
from xtagger import BERTForTagging

df_train = xtagger_dataset_to_df(train_set[:500], row_as_list=True)
df_test = xtagger_dataset_to_df(test_set[:100], row_as_list=True)

train_tagged_words = [tup for sent in train_set for tup in sent]
tags = {tag for word,tag in train_tagged_words}
tags = list(tags)

device = torch.device("cpu")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

dataset_train = df_to_hf_dataset(df_train, tags, tokenizer, device)
dataset_test = df_to_hf_dataset(df_test, tags, tokenizer, device)

from xtagger import BERTForTagging
model = BERTForTagging("bert-base-uncased", device, tags, tokenizer, log_step=100)

model.fit(dataset_train, dataset_test)

#Accuracy: 95.9592%

tags, _ = model.predict('the next Charlie Parker would never be discouraged.')
print(tags)

and the output will be

[('the', 'DET'), ('next', 'ADJ'), ('Charlie', 'NOUN'), 
('Parker', 'NOUN'), ('would', 'VERB'), ('never', 'ADV'), 
('be', 'VERB'), ('discouraged', 'VERB')]

Flexibility On DatasetsPermalink

You can convert anything to x-tagger and vice-versa. You can play with nltk’s tagged sentece corpus, it automatically returns list of tuples

import nltk
conll2000 = nltk.corpus.conll2000.tagged_sents(tagset="universal")
indian = nltk.corpus.indian.tagged_sents()
sinica = nltk.corpus.sinica_treebank.tagged_sents()
conll2002 = nltk.corpus.conll2002.tagged_sents()

DiscussionPermalink

  • Evaluation procedure of HMM tagger is computationally expensive for trigram and deleted interpolation. It runs O(n3) Viterbi decoder.
  • Practically, current implementations can work with all languages.
  • There are upcoming features soon:
    • Bidirectional Hidden Markov Models.
    • Morphological way to deal with unkown words (language dependent).
    • Maximum Entropy Markov Models (MEMM).
    • Prior RegEx tagger for computational efficiency in HMMs (language dependent).
    • Beam search.

Appendix: Deleted InterpolationPermalink

Deleted interpolation is proposed in Jelinek and Mercer, 1980. defined as:

p(titi1,ti2)=λ1C(ti2,ti1,ti)C(ti2,ti1+λ2+C(ti1,ti)C(ti1)+λ3C(ti)N

and λ parameters can be obtained as deleted interpolation algorithm

#pip install xtagger
import nltk
from sklearn.model_selection import train_test_split

from xtagger import HiddenMarkovModel

data = list(nltk.corpus.treebank.tagged_sents(tagset='universal'))
train_set,test_set =train_test_split(data,train_size=0.8,test_size=0.2)

hmm = HiddenMarkovModel(extend_to = "deleted_interpolation")
hmm.fit(train_set)
hmm.evaluate(test_set, random_size=5, seed=120)

Updated:

Leave a comment