Challenges in using NLP for low-resource languages and how NeuralSpace solves them by Felix Laumann NeuralSpace
There are particular words in the document that refer to specific entities or real-world objects like location, people, organizations etc. To find the words which have a unique context and are more informative, noun phrases are considered in the text documents. Named entity recognition (NER) is a technique to recognize and separate the named entities and group them under predefined classes. But in the era of the Internet, where people use slang not the traditional or standard English which cannot be processed by standard natural language processing tools. Ritter (2011) [111] proposed the classification of named entities in tweets because standard NLP tools did not perform well on tweets. Pragmatic level focuses on the knowledge or content that comes from the outside the content of the document.
This type of data is common in the health area, and its major advantage is its capacity to separate cohort and temporal effects in the context of the analyses (Diggle et al. 2002). For example, longitudinal data is part of clinical studies that follow a group of patients with diabetes over five years to track changes in their blood sugar levels and complications. Longitudinal data contrast with cross-sectional data in which a single outcome is measured for each individual. Thus, longitudinal data analysis can generate important conclusions for health personnel from a temporal perspective. For example, the study of Zhao et al. (2019) relies on longitudinal Electronic Health Records (EHR) and genetic data to create a model for 10-years cardiovascular disease event prediction. Similarly, Severson et al. (2021) employed longitudinal data collected for up to seven years to develop a Parkinson’s disease progression model for intra-individual and inter-individual variability and medication effects.
Errors in text and speech
A broader concern is that training large models produces substantial greenhouse gas emissions. NLP is one of the fast-growing research domains in AI, with applications that involve tasks including translation, summarization, text generation, and sentiment analysis. Businesses use NLP to power a growing number of applications, both internal — like detecting insurance fraud, determining customer sentiment, and optimizing aircraft maintenance — and customer-facing, like Google Translate. Data augmentation is a data pre-processing strategy that automatically creates new data without collecting it explicitly.
Finally, there is NLG to help machines respond by generating their own version of human language for two-way communication. Fan et al. [41] introduced a gradient-based neural architecture search algorithm that automatically finds architecture with better performance than a transformer, conventional NMT models. They tested their model on WMT14 (English-German Translation), IWSLT14 (German-English translation), and WMT18 (Finnish-to-English translation) and achieved 30.1, 36.1, and 26.4 BLEU points, which shows better performance than Transformer baselines.
Natural Language Processing: Challenges and Future Directions
A key question here—that we did not have time to discuss during the session—is whether we need better models or just train on more data. So, for building NLP systems, it’s important to include all of a word’s possible meanings and all possible synonyms. Text analysis models may problems in nlp still occasionally make mistakes, but the more relevant training data they receive, the better they will be able to understand synonyms. These are easy for humans to understand because we read the context of the sentence and we understand all of the different definitions.