Natural Language Processing for Documents

How about a quick intro into Natural Language Processing (NLP) for documents? Join this episode with our guest, Nina Hristozova, Data Scientist (AI & NLP) at Thomson Reuters. We'll learn more about NLPs and their applications and challenges, but also their current usage in documents and text based formats.

George: Hello, everybody. Welcome on the lights on data show. My name is George.

Diana: And I'm Diana.

George: And today we're going to talk about natural language processing.

Diana: That is amazing. And we have a wonderful guest. Her name is Nina Hristozova and I hope that I pronounce it. Oh well, but I think I did. A little bit about Nina:

She loves working on a wide range of projects, applying machine learning and deep learning to natural language processing business problems. She is currently working as an applied to research scientist in NLP at Thomson Reuters in Switzerland. She is also a mentor and is excited about how her own experience can change the life of others.

We are very excited about this topic and also for having Nina with us today. Welcome Nina.

Nina: Hi. Hi, well and yeah, welcome everyone. And thank you, George and Diana for inviting me, as I said I'm very excited to be here and a bit nervous, but yes,

Diana: Yes. You were mentioning that. It's interesting to be on the other side, which is I'm very excited to have you. All right. Tell us a little bit about yourself, a fun fact about yourself.

Nina: A fun fact. Alright. A fun fact is that I am involved in a volleyball activity five days a week.

Diana: Oh my gosh. It's amazing.

Nina: Yeah. I play in coach, so. It's one part of my life.

George: That's awesome. I like it. It has nothing to do with NLPs. It's good to take your mind off.

You know, we hear that very often, especially when you're in programming, when you're in data science, you got to do some other activities get your mind off of it. And that's when your best ideas come. That's when you start figuring out how a problem can be solved.

Nina: It definitely helps to cool down.

Diana: So getting a little bit into the field, how did you get to work into NLP. So why didn't you do, I don't know, music or something?

Nina: Well, the NLP specifically was maybe a bit by, by chance or coincidence. I don't know how you call it, but when I did my bachelor's in computer science, in Scotland, the university of Glasgow and then my bachelor thesis at the end, I decided that, okay, I don't want it to be in software engineering, I want it to be in research. I want to see what that is. So coincidentally, it was around data science. And what I had to do in my bachelor thesis, and I really liked it. So I decided, okay. Let's see if I can find an internship in data science and see what it is.

And luckily, I found an opportunity in Thomson Reuters. So actually I started first as an intern, as a data science intern, and then I was supposed to, this was supposed to be my gap year, and then I go and do my master's, but I liked it so much that I just stayed. This is my story.

Diana: That's very nice. Tell us what a NLP stands for and what it is. Because when I heard it, I'll be very honest, I thought it's neuro-linguistic programming and it is not.

Nina: Yeah, the acronyms are the same. I understand. It's natural language processing. So in general, the idea in natural language processing that the field is, is quite vast, but you use some kind of computational technologies for analysis and synthesis of natural language and also speech. And in my case, I work a lot with text data. So this is where my natural language comes from. I work with documents, news articles, and so on and so on. So yeah, it could be.

George: And is it mostly. the work that you are in charge of mostly a written content or is it also audio, video?

Nina: So currently I work mostly with written content. Yeah. It would be mostly documents.

George: Right. Which is a challenge in its own. Are there also other applications of NLPs?

Nina: So there are quite a few applications I would say. So for example, something that is very common is something called named entity extraction or also called named entity recognition. So this is for example, you have text and then you want to identify specific entities from this text.

There are some out of the box tools that already exist that do this. For example, if you want to extract monetary values, or if you want to extract people names, organizations, location, and things like. But then also you can build your own custom NLP or named entity extraction systems.

And for example, when we make it custom, then it's, we don't say, I want the name of people from this document, but maybe you say, I want to see who the seller is or who the buyer is. I want to see who the plaintiff or the defendant is. And here you see like, oh, the legal. Yeah, because Thomson Reuters, a lot of it is legal. So we work with a lot of contracts. So this is kind of the more custom part. Then others are, for example, some kind of sentiment analysis of comments, tweets, and so on, or some kind of clustering or classification. So for example, you have different news articles and then you want to classify them into are they about sports, or are they about celebrities, about the world and so on.

Then there's also text summarization, which is a very challenging and difficult problem. So you have a big block of text and then you want to extract the most important parts of. Where you recreate them. And then you have question and answers, chat bots, etc.

George: Very interesting. So, for example, if we were to, you know, explore all these comments and hello, Ravit and hello, Susan Hello Rajendra and everybody else that's joining us from all over the world right now.

So, if we were to export all of these, hi LinkedIn user, from San Francisco, I guess there would be a way through NLP algorithms to extract either the names or the location that they've mentioned, and maybe even the sentiment associated with the message.

Nina: Exactly. Yes.

Do you know Nina, you know, is the NLP advanced so advanced right now that it can detect sarcasm? I would assume that's an issue, right? Even us as humans. We find it hard sometimes to detect sarcasm.

Maybe if you somehow label it, because at the end of the day, NLP is the technique that you would usually use in combination with AI systems or artificial intelligence systems. So maybe there's go data or some kind of labels, a lot of training data that says, okay, this piece of this expression is sarcasm, then maybe yes.

George: Ravit was wondering "which industries do you think benefit the most from sentiment analysis?"

Nina: I would think maybe all industries, I mean, whenever you have reviews, let's say or some kind of discussion that you would like to monitor and things like that.

So it's really mostly when people write comments or express their opinions and things like that Entertainment is a very often example of this.

George: I remember I read somewhere for get where in what stage of development at it's on, but in healthcare industries, they're trying to use the sentiment analysis to detect different patterns in the speech and how that patient might be feeling in order to detect if, maybe there are on the verge of depression. Or even detect certain diseases like Alzheimer or schizophrenia. So I thought that was really fascinating that that's where we could be one day.

Nina: Yeah, definitely. There are also many startups that analyze speech. So for example, call centers and they have the audio files and then they analyze the speech, the intonation of the voice as well. So you can also go into this direction and try to somehow measure sentiment.

Diana: Do you combine your research with psychology in any way?

Nina: No, not really at Thomson Reuters now.

Diana: Right, because you were saying that you're dealing mostly with legal documents are probably in this case, it wouldn't have that much to do with psychology as if it were, I don't know, marketing, or as you said, entertainment.

Nina: We try to help our users as much as we can. So for example, when text summarization, let's say we have a use case for text summarization, where we have an editorial team, a internal editorial team that would summarize legal cases. And then what we did is that we developed an AI solution to pre-fill the blank box where they are supposed to fill in a summary from scratch. And now instead of having to write it to start from nothing, they start from something and they review it already. So we kind of augmented their workflow. What we saw with. 'cause summaries, it's not like you always extract sentences from the text and just place some there, but we try to actually have more humanlike summaries.

So sometimes the model might be generating their own words and things like that. So it's not that easy to map it directly to the text, but what we can get from the model is something called attention. So we can actually display in the contract, in the source document, where the model paid attention to, and we would display this to our users and that would help them when they are taking this up. So we tried to bring this explainability aspect.

George: That's fascinating. And do you know, does it all come in one voice? And what I mean by that is sometimes when you're reading a document or a summary that somebody wrote, you're like, ah, that's, you know, writes like Johnson. That's like Johnson wrote that just because of the style.

Is it just one style or you can just match whoever wrote the article?

Nina: When we trained all those models, we would use data from, so for example, with the data will be really a mixture of editors. So it wouldn't be really one universal voice that we trained the model with, but it will be many, many of them. And so it's very interesting what you're asking because we very often actually showed this example where we showed the human generated summary and the machine generated summary and very often, a lot of people they cannot pick which one is which.

George: That's amazing. Were there a lot of iterations, are there still iterations to improve that AI algorithm? Is there feedback, a loop in once the editor is reviewing that summary, maybe loops it back to somehow improve that original algorithm?

Nina: Kind of, I I'm thinking that's, what you're talking about is a bit like human in the loop systems, which are very fascinating for me personally and I love working with such systems or trying to develop such systems. For this particular use case we don't have this functionality yet. Or it's not the time for the moment. But we have other use cases that we've tried something similar. And this is very interesting because when you kind of use those human in the loop systems, it also helps you. Because now, let's say there's always the challenge of the those new, very complex and big models, they require big data, but actually with human in the loop system, they are kind of a solution or a way to tackle this problem when you don't have big data, when you have small data. So you can start from something less complex as a solution that doesn't require that much data. And of course, in this case, the user might have to do a bit more of reviewing and correcting and so on, but after accumulating a number of samples and so on and so on, then they can feed them back to the system, retrain it and then see again. So increase its quality. preferably.

George: Thanks for that example. Wonderful Rejandra is wondering which languages do you process? Is it just English or German? Japanese, Spanish, Russian? Are there other languages being part of the process?

Nina: We mostly work with English data, but for NLP, definitely you can process whichever languages you like.

And of course, for some languages, the technologies or the data, I would say, because it's very important around like data is a big thing here. For some languages there's more available data for others, not so much. Machine translation is one example that you can translate between different languages and yeah, there are definitely, this is part of NLP.

George: Very interesting. I want to bring this comment as well from Myriam. And she's mentioned that it's very interesting. The question about working with psychologists and family law. There may be professionals that need to gather information for a case, which involves reports from psychologists and other parties. Being able to provide the value of text summarization in the space could certainly help other use cases involving these professionals. Absolutely. Myriam, thanks for for sharing this with us. It's yeah, it's fascinating.

Nina: She's one of my colleagues that comes in research labs.

Diana: Nice to have you.

George: So I'm, I'm wondering, Nina, are there different techniques, different algorithms that are used for NLPs depending on their application?

Nina: Depending on application, the differences in algorithms or in models is becoming less and less. I would say with time and with research and with developing the field, but I would definitely say that there are different ways to handle things. Let's use the same examples as so far. We have main entity recognition or extraction, and then we have also text summarization. So text summarization is something that we call a sequence to sequence task, because we have a long sequences input and then it has to generate also kind of relatively long sequences output, which is multiple sentences.

And then for named entity recognition or extraction, this is more of, so you have let's say, I dunno, you have a question. You can say it like that. Who is, I don't know who is the landlord. Let's say. And then you can extract from the text, the name of this organization or of this person and so on.

For this, you can also treat this as classification tasks. So you can represent the text of the document where you want to extract from as words or tokens, we also call them tokens, and then you can label them and say, okay, this word is an entity, this isn't happening, this isn't and so on and so on.

And this is also your kind of training data that you're going to be using at the end. So this is kind of a bit different in terms of how the output is represented.

George: Very interesting. What challenges there still are within NLPs?

Nina: I think definitely one challenge is this long input documents or long documents problem.

And they're often the question is, okay, what do we really do with those? I mean, openness in terms of research and overall. So for example, do we try to fit all of it into a model? Because this is very memory heavy. Or do we somehow develop techniques where we try to I don't know, figure out which parts of the text exactly our model needs to be inputted to. This handling of long paper documents is still something that is definitely discovered, I guess. And there's a lot of work on that. And then the other one challenging thing is as well, human in the loop system I would say. Because here you have the user, so how a user interacts with the system. It's very subjective. I mean, it could be very subjective because everyone is very different and they have different needs and so on and so on. So developing one universal human envelope system, might not work.

George: Right. Ambrotha is wondering if you can please share some key challenges that you're facing while handling data for NLP use cases.

Nina: When you're working with documents then it's also very challenging because sometimes those documents have to come from scans of PDFs or images and so on. So you have those techniques or those tools that are called OCR, that they preprocessed this image, this scan, or this PDF.

And then they output the raw text that I'm working with. So sometimes there would be mistakes in this OCR. They wouldn't be perfect. There would be noise or for example, misalignment of the different columns where the information is and so on. So this, I would say it's quite often very challenging.

George: So it still requires that human input to go in and correct those issues.

Nina: Ideally, but in reality, we try to work around. To somehow either try to improve the extraction from the PDF or for example, some of my colleagues actually, because of this problem, actually work on using a summarization technique for names entity extraction. So this was how they tried to work around this.

George: Right. You know, I would imagine that, for example, in the German language, because all the nouns are written with a capital letter, it's so much harder to detect, which one is a pronoun, somebody's name, than in English. Anyways, I was just thinking that out loud.

Nina: There are actually pretty cool out of the box tools that do this. So thankfully we don't have to start from scratch. One library that I use very often. It's called spaCy. And in there you can actually find support. And I think that they might also have something for not only English languages. I'm not super sure, but they might have something.

George: So there are some openly available libraries that people could use.

Nina: Oh, yeah, definitely. And for me, at least in my world, very often I apply what is already there. So for example, for things like this, I would, I wouldn't start and write something like this from scratch, but I would see first if there's something that exists open source and then use it.

Diana: What was the biggest step forward or the biggest progress that you saw since you started working with this? Or maybe your biggest personal success? Whatever you would like to share.

Nina: I can share both. There's time. So there's something called in natural language processing that's called the feature engineering. Because the thing is that with natural language processing, you have to figure out a way to turn this text into numerical data so that the machine can use it or an AI model can use it. If you want to train a machine learning model, then very often there was this step where you as the data scientists are the one engineering those features. And something is for example, what you guys mentioned, like, is this a pronoun, is this a verb and so on? So I would have to think about these features myself and really engineer them and feed them to the models.

Now with the age of deep learning and all those computational resources that are easily available now the step is kind of taken from you. You don't have to do that anymore, but the models did this themselves and there is something called "word embeddings" which actually is a step further. And they try to capture the semantic meaning of things. So for example, we could have two sentences, lets say "one apple a day keeps the doctor away". And also, I don't know, "apple released the new iPhone" or something like that. Then with those word embeddings, you can actually understand that the context is different and that "apple" is not the same. So it's actually two different words. In one sense it's organization and other sense is a fruit.

George: That's so interesting. That's such a very good point. That context is really everything in here.

Nina: Absolutely.

Diana: When you translate, you translate and adapt, right?

George: Right. I know in Mandarin, for example, the same word could have so many different meanings.

And not just Mandarin, but I think in particular Mandarin, somebody was giving us an example that I don't know, this one word can mean seven different things, but depending on the sentence, the context of the word usage, it can really mean different things.

Nina: Yeah, definitely. This is a very big breakthrough, I would say, in NLP that you can actually capture the context and somehow represent this in a vector space.

George: Do you also have libraries of big data with pronouns or any other pieces that you can look at company names so that you know that this is likely to be a company because we can find a name in this database?

Nina: This library that I mentioned spaCy they also have their own named entity packers. So for example, this is one out of the box, too, that you can use very easily for benchmarks. Let's say you want to extract, I don't know, organization, people names and so on, and then you can always start with this with these two. They're usually pretty good at this. So I guess that they have a database of all those names. If you want to make it custom, then you have to train your own model. And this is where the very fun part comes.

Diana: And this is why Nina exists in this world.

George: Rajendra was wondering if you can mention about the size of the original legal document versus the size of the summary, maybe number of words, is there always a direct relationship between the two?

Nina: Very interesting question. It really depends on your training data. Ideally we would ask the model to generate the size of a summary that is not much bigger than what you're training it with. So for example, in the particular use case that I mentioned our summary there, they were about two to three sentences.

So when, when we were asking the model to generate or when we're training a model to generate summaries, we would specify that we want this kind of length. And in terms of the input documents or the legal document this could really vary from one to 100 pages or even more. It's really dependent on the contract.

Sometimes you might get lucky because we have subject matter experts internally and they would say, "okay, for this use case, we look usually when we generate the summaries, we find the most important things in the first 15 pages". So you don't have to always read the whole contract. But sometimes they say, "well, it might be really anywhere". So then you have to try to tackle this challenge.

Diana: We have two more minutes and I have two more questions. So the first question is what is the future of NLP? Where are we heading?

Nina: That's a tough question. I don't know, to be honest. I mean, there are many directions that we're trying to go into. One for sure is around explainability. So, something that they mentioned before: "okay, how can we how can we show where the model paid more attention to when they were generating a certain output?". So this is definitely one direction that I think we'll hear more and more about, because there's also helps the users, the end users.

Then maybe somehow there are other emerging technologies let's say, I don't know, quantum computing or blockchains and so on. So maybe some kind of synergy between NLP and those technologies as well would be, I think we might be seeing a lot more in the future.

George: Fascinating. And second question?

Diana: And that would also be my last question, how can people get in contact with Nina and hear more from her?

George: I definitely to follow you on LinkedIn and I'm just posting your link here and I'll also post it in the comments of the podcast and of this video. So I do encourage people to follow you on LinkedIn and to follow your content and get in touch with you.

Diana: And is there anywhere else or anything else that you would like to share from what you're doing and that people can can follow you?

Nina: Thank you. Yeah, as George and Diana said, you can find me on LinkedIn. If you have any follow-up questions, I'm more than happy to answer. Also I'm a co-organizer of meetups for on NLP or natural language processing. It's called NLP Zurich. Yeah, I can share these details as well.

We also have LinkedIn presence. We usually organize meet ups twice a month, or we try to, and they're on different different topics in NLP or data science. So it's not only NLP specific for now. And yeah, it's a, it was a pleasure.

George: Thank you so much. And thank you so much, everybody for all your questions and sorry that we didn't get a chance to get to all of them.

Diana: Thank you very much. It's always a pleasure to have you on the podcast and thank you, Nina. It was lovely meeting you and thank you for sharing all these lovely insights with us.

Nina: Lovely meeting you, too.

Diana: Have a good weekend. Bye. bye. Bye.

Human in the Loop AI: Why It’s Often Just a Checkbox

Data Observability vs. Data Quality: A Comprehensive Discussion

Watch and Listen to Your Favorite Episodes!

Watch to the Video Version

Watch on

YouTube*

Watch on

LinkedIn Live**

(during live event)

*Voted as #1 Most Helpful Data Video Channel of 2020 by the audience of DataLiteracy.com

**Voted Top 3 Data Podcasts by Data Community Content Creators Awards of 2021

*** Named "Best 10 Data Science Podcasts You Must Follow" in 2021, 2022, 2023, 2024, 2025

**** Named "Top 3 Data Management Podcasts" in 2023, 2024, 2025

Listen to the Podcast Version

Listen on

Apple Podcast

Listen on

Anchor.fm

Listen on

Spotify

Listen on

Podchaser

Listen on

Castbox

Listen on

Pocket Casts

Listen on

Breaker

Listen on

Amazon Music

Subscribe & Listen to the Podcast Version

Listen on

Apple Podcast

Listen on

Anchor.fm

Listen on

Spotify

Listen on

Google Podcasts

Listen on

Castbox

Listen on

Breaker

about the show

Each episode puts the lights on various data topics with renowned industry experts. We cover the following topics in a fun and informative interview format: data science, data analytics, machine learning, artificial intelligence, data visualization, data storytelling, data governance, data management, data quality, data strategy, and much more.

Learn more about the Lights On Data Show

Do you want to be featured on the show?

Related episodes

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

Natural Language Processing for Documents

You may also like

Human in the Loop AI: Why It’s Often Just a Checkbox

Data Observability vs. Data Quality: A Comprehensive Discussion

Watch and Listen to Your Favorite Episodes!