Natural Language Processing: A Guide to NLP
What is Natural Language Processing?
Natural Language Processing, or NLP for short, is a field of machine learning and linguistics focused on understanding everything related to human language. There is a set of classical NLP problems, the solution of which is of practical use.

Main NLP Usecases
The following is a list of common NLP tasks:

  • Text classification is one of the most common NLP tasks. Given a set of texts it classifies each one into specific or multiple categories. Text classification has a variety of real-world applications, not limited to: getting the sentiment of a review, determining whether an email is a spam, determining if a sentence is grammatically correct, news categories classification
  • Machine translation is one of the first problems of NLP. For the last decade, there has been huge progress in that area with the usage of transformer architecture. But the task of obtaining a fully automatic translation of high quality (FAHQMT) remains unresolved.
  • Named-entity recognition: The task of NER is to select the continuous fragments of entities in the text. Let's say there is a news text, and we want to highlight entities in it (some prefixed set, for example, persons, locations, organizations, dates, and so on). The task of the NER is to understand that the section of the text "January 1, 2020" is a date, "Elon Musk" is a person, and "Tesla" is an organization.
  • Relation extraction: Identifying the grammatical components of a sentence (verb, adjective, noun), or the named entities (person, location, organization)
  • Chatbots can simulate a conversation or a chat with a user in natural language through messaging applications, websites, mobile apps, or through the telephone. Amazon and Alexa are some of the most common examples of chatbots.
  • Text summarization system accepts a large text as input, and outputs a smaller text that somehow reflects the content of a large one. For example, the machine is required to generate a retelling of the text, its title or annotation.
NLP Tools and Libraries

  • NLTK — a base for any NLP project

    The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. NLTK includes graphical demonstrations and sample data. Whether you're a researcher or an ML engineer, NLTK is likely the first tool you will encounter to play and work with text analysis. It doesn't, however, contain datasets large enough for deep learning but will be a great base for any NLP project to be augmented with other tools

  • CoreNLP — language-agnostic and solid for all purposes

    A comprehensive NLP platform from Stanford, CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. It has pretrained models in 6 human languages.

  • Gensim — a library for word vectors

    Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently and painlessly as possible. Originally it was created for unsupervised information extraction tasks such as document indexing, similarity retrieval and topic modeling, but mostly it is used for working with Word2Vec vectors. It was designed with practicality, memory optimization and performance principles in mind.

  • spaCy — business-ready with neural networks

    Considered an advanced version of NLTK, spaCy is designed to be used in real-life production environments, operating with deep learning frameworks like TensorFlow and PyTorch. spaCy is opinionated, meaning that it doesn't give you a choice of what algorithm to use for what task — that's why it's a bad option for teaching and research. Instead, it provides a lot of business-oriented services and an end-to-end production pipeline.

  • TextBlob — beginner tool for fast prototyping

    TextBlob is a more intuitive and easy to use version of NLTK, which makes it more practical in real-life applications. Its strong suit is a language translation feature powered by Google Translate. Unfortunately, it's also too slow for production and doesn't have some handy features like word vectors. But it's still recommended as a number one option for beginners and prototyping needs. As you can see from the variety of tools, you choose one based on what fits your project best — even if it's just for learning and exploring text processing. You can be sure about one common feature — all of these tools have active discussion boards where most of your problems will be addressed and answered.

  • Transformers — State-of-the-art Natural Language Processing

    Transformers provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax, PyTorch and TensorFlow.

© All Right Reserved. Columb Labs Inc.
e-mail us: info@columblabs.com