Newest 'text-mining' Questions - Stack Overflow

Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

0
votes
0answers
11 views

Mallet stops working for large data sets?

I am trying to use LDA Mallet to assign my tweets to topics, and it works perfectly well when I feed it with up to 500,000 tweets, but it seems to stop working when I use my whole data set, which is ...
-1
votes
0answers
24 views

Optimal stop words removal method supporting R 3.3.0 [on hold]

I tried to remove stopwords from R versio 3.3.0 or below I tried using tm but it not able install slam. spacyr etc.. has python dependancy? I tried using custom function its to slow.
1
vote
2answers
39 views

How to use trycatch to skip errors and move on to next in the list

I want to parse rtf files from a folder in the rtf files that resulted in errors during the lapply step. I am new to using trycatch, so how can I incorporate it in my code(the lapply step) to ignore ...
-1
votes
0answers
34 views

Stock price prediction with financial news in R?

I am working in R with stock prices (sp500) and text documents. I downloaded financial news which already passed the preprocessing process with the help of the tm package. I created a corpus, cleaned ...
4
votes
2answers
22 views

Count the occurrences of words in a string row wise based on existing words in other columns

I have a data frame that has rows of strings. I want to count the occurrence of words in the rows based on what words appear in the column. How can I achieve this with the code below? Can the below ...
1
vote
1answer
55 views

Using cosine similarity for classifying documents

I have a set of files for five different categories and most of them are not labelled correctly.Objective is to predict the correct category of the file whenever the same is uploaded.I used cosine ...
0
votes
1answer
20 views

Upload .txt as one cell for further editing

My question: how to upload a .txt to appear as one cell I need to prepare my data first and want to do that in R. So I want to upload the .txt file I have to R so that it is in the shape of a single ...
0
votes
1answer
30 views

Sentiment Analysis in R using TDM/DTM

I am trying to apply a sentiment analysis in R with the help of my DTM (document term matrix) or TDM (term document matrix). I could not find any similar topic in the forum and on google. Thus, I ...
-2
votes
0answers
23 views

Similarity between sentences using w2v

Firstly I would like to call out explicitly that I am new to the world of machine learning and data science. Use Case I currently have a use case where I am looking to find similarity between ...
-4
votes
0answers
34 views

Algorithm for very short text summary

We need to do a Python Script which summarizes a long text. We already got a summarizer, but now we would like something that really just extracts a very short summary of the text. We have a good ...
-3
votes
0answers
13 views

Assign words into a variety of topics

I have around 2000 separate words, and I want to filter and assign them in related topics (all related to daily activities that a person performs. Like pizza, park,soccer,computer) . My topics for ...
0
votes
1answer
51 views

Passing multiple arguments as a list in R

I wish to pass a list of arguments as a vector to another command in R. I do not want to repeat the same set of arguments every time. This is the code that I have to run 6 times for each $full_text ...
-3
votes
0answers
15 views

What library to use to detect check boxes and tick marks?

I have multiple pdf files out of which i am creating a dataset, There are some tickmarks and check boxes type of questionnaires on the pdf , can you suggest me library can i use in python to extract ...
0
votes
0answers
7 views

Download older version of a package that has been removed from CRAN

I resume working on text mining after a substantial period of hiatus, but soon found out that the package RTextTools has been removed from CRAN and was no longer in maintenance. I tried to download it ...
0
votes
0answers
15 views

Using custom corpora for classification

I created a custom corpus using PlaintextCorpusReader of nltk and created word2vec model with gensim for them. Now i want to use that in tf-idf calculation to be able to apply on text classification. ...
0
votes
0answers
32 views

How to summarize email text using LDA in R

I am working on complaints data analysis where I am adapting text summary technique for reducing unnecessary text and bringing out only useful text. I have used LDA - Latent Dirichlet Allocation in ...
0
votes
0answers
10 views

How can I select one text in a tCorpus in R?

For my master thesis, I am analyzing the amount of nationalist words in presidential speeches. Based on a dictionary "dict" (here it only consists of 3 word groups), I would like to create vectors in ...
-1
votes
0answers
18 views

Is there any tool to create a conclusion from lots of sentences?

Not sure if this question can be asked here, if it is not suitable, please leave a comment, I'll move it to suitable place. The problem is that I have lots of thesis subjects I may devote to, e.g. "...
-2
votes
0answers
50 views

Python - NLP for Reading Fare Rules Raw Text and putting in tables in database or pass to API

I have a unique problem and trying to solve but couldn't find a suitable solution. This is about reading the complex fare rules raw text for different airline and put the details in tabular format ...
1
vote
2answers
37 views

How to associate a date extracted from a pdf file with the data extracted from it using R?

What I Have I have two .pdf files that have a table inside with buy and sell stock information and a date on top right corner header of each page. See the files here. If necessary save the two .pdf ...
0
votes
0answers
18 views

Topic label of each document in LDA model using textmineR

I'm using textmineR to fit a LDA model to documents similar to https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html. Is it possible to get the topic label for each ...
2
votes
1answer
54 views

Subsetting a character vector column into multiple columns

I have the following tibble: colours = tribble( ~all, c('blue','green', 'red', 'pink', 'yellow', 'gold', 'orange', 'ivory', 'brown', 'beige'), c('green', 'red', 'pink', 'orange', 'ivory', '...
-1
votes
0answers
20 views

An efficient approach to querying Wikipedia in R

Recommendations for an efficient approach to query Wikipedia in R and making a corpus of the results. So taking search terms and returning page contents for later text analysis . e.g. a variable ...
0
votes
1answer
41 views

how to remove empty value after we do preprocessing text in python

e.g I have a tweet "@cintya @groot @smanela https://blog..." and I do a preprocessing process that link and mention has been deleted, and I think it should be lost. But in CSV, they return an empty ...
0
votes
1answer
16 views

Finding tf-idf values in a announcement table

I want to do an analysis of an announcement.I have to calculate 'tf' and 'idf' values. But I think the values ​​are not realistic. Is there a problem with the code? "stemming" line is announcements. ...
-2
votes
0answers
16 views

How find noise(classified) record in Clustering (k-means) in each cluster

we have 100 data points after clustering the data-set, these data-point grouped into 5 clustering, I want find noisy data point(miss-classified) data point in each cluster.
0
votes
1answer
12 views

Link 2 annotations together in a window of up to 10 words using Ruta

is there a way to link 2 annotations that are within a windows of 10 words, together in a new one? The following doesn't work: Entity W{1,10} Entity{->CREATE(Entity)}; Thanks and all the best ...
-2
votes
0answers
30 views

Text embedding at multiple separate colums

i have this ambiguity. I have a dataset with multiple columns of categorical data and my consideration is to apply tf-idf (string embedding). My main concern is if it can be done for each column ...
2
votes
1answer
17 views

String Concatenation in Ruta

does somebody know what is wrong with my String Concatenation in Ruta? FOREACH (d) IngredientConcept{} { d{->CREATE(Entity, "label"="Drug", "value"= d.conceptID + "_" + d.dictCanon)}; } ...
0
votes
2answers
38 views

Text mining a large list of Notes for Vehicle Identification Number (VIN#) with Python

I have a large data set of Insurance Claims data with 2 columns. One column is a claim identifier. The other is a large string of notes that go with the claim. My goal is to text mine the Claims ...
0
votes
2answers
56 views

How to filter out lines within each row in R?

I have been trying since many days on this problem but couldn't get the expected results. I have a dataframe containing conversations of two person A & B within each row (it is like 1 row ...
1
vote
1answer
32 views

Interactive learning [closed]

I'm new in NLP and text mining and I'm trying to build a documents classifier. Once the model is trained, we test it on new documents (they, test-data, don't have labels). It is expected that the ...
0
votes
0answers
16 views

Create a co-occurrence dataframe and the distance among elements through cosine function in R

i have a dataframe structured like this: Variable S1; S2 S3; S1; S2 S4; S2 i want to obtain a new dataframe made by three columns: V1. V2. Dist S1. S2. 2 S2. S3. 1 S3. S4. ...
0
votes
1answer
17 views

Read multiple pdf files at once and extract sentences that contain a keyword using R

Let's assume that I have few pdf files stored in a directory and I want to read all those pdf files at one and extract all the sentences that contain a specific keyword (in this case 'provisions') ...
2
votes
1answer
54 views

Extracting university names from affiliation in Pubmed data with R

I've been using the extremely useful rentrez package in R to get information about author, article ID and author affiliation from the Pubmed database. This works fine but now I would like to extract ...
0
votes
1answer
67 views

How can I generate a word cloud from tokenized words in Python?

I have a code to import a txt file and get tokenized words using NLTK library (just like it is done in https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk). I did almost ...
0
votes
1answer
27 views

How do I append reviews text and reviews rating to a list

I am writing a program which analyses online reviews and based on the ratings, stores the review into review_text and the corresponding rating into review_label as either positive(4 & 5 stars) or ...
1
vote
1answer
20 views

Mining a Dataframe for a Count of Unique Words

I'm looking to take a set of strings in a dataframe and then break those strings up in order to get a count of distinct words in the strings. The ultimate idea is this: Word 1: 5 times Word 2: 3 ...
0
votes
1answer
23 views

Read text files with paragraphs as one string using VCorpus from tm package in r

I have a list of text files in my directory, all of which are documents with multiple paragraphs. I want to read those documents and do sentiment analysis. For example, I have one text document data/...
0
votes
0answers
17 views

Pubmed Mine R : How to remove author names and informations of the abstract?

I'm working on a code able to extract automatically key-word from Pubmed abstracts. With the pubmed.mine.r packages, it's possible to read and atomize abstracts words. But I really don't want to take ...
1
vote
0answers
30 views

importance feature xgboost for text predictions

i have two text file for rating review positive and negative after the preprocessing the data with nlp i make prediction with XGboost and i am tried to get importance feature as categorical variable ...
0
votes
0answers
35 views

classify polarity Query using sentimentr

I started a tutorial using the sentiment package & had to change to sentimentr as the aforementioned has been removed from the R repository. In sentimentr what function / library would I use to ...
0
votes
1answer
43 views

How to use NLTK countvectorizer in a dictionary in python?

I have used csv reader to read my tsv file, which contains three columns lie, sentiment and review. I have created dicitonary to read my tsv file data as shown in code below. Next. I would like to ...
0
votes
1answer
21 views

How to load a folder (with text files) from your computer on Jupyter to be able to run analyses on them together?

I am trying to load a folder (containing about 1000 .txt files) on my Jupyter notebook (Python 3) from the desktop of my WINDOWS computer; so that I can proceed with my analyses relating to NLP. I am ...
0
votes
2answers
51 views

How to remove white spaces within a word using python?

This is the input given John plays chess and l u d o. I want the output to be in this format (given below) John plays chess and ludo. I have tried Regular expression for removing spaces but doesn't ...
0
votes
0answers
24 views

Can't get the results for review extraction using text mining in r

I'm trying to extract the reviews of a specific product from FLIPKART, but can't get the desired results # Importing reviews data library(rvest) library(XML) library(magrittr) # Poco F1 Reviews #####...
0
votes
2answers
34 views

Is there an R technique to group_by, search, and match a long data structure?

This is a problem of finding which ids have matching words, from a list of 5 words for each id. We have a long data structure from a text mining project with an id and the word. Each group_id has 5 ...
0
votes
2answers
34 views

How to find most frequnet words in a corpus in Pandas dataframe (Python)

I have Pandas dataframe that looks like following.I have tokenized my text files and used NLTK Countvectorizer to convert into pandas dataframe. In addition, I have already removed stopwords and ...
0
votes
1answer
74 views

Interpret the Doc2Vec Vectors Clusters Representation

I am new to Doc2Vec, please bear with the naive questions. I have generated Doc2vector score i.e. using the 'Paragraph Vector' algorithm. I have an array output for each document. I use the model....
0
votes
0answers
65 views

How to calculate precision / recall score for keywords in sklearn / Python?

I have the following code that calculates precision/recall and F1 score for my model, which detects keywords in a document: from sklearn.metrics import classification_report y_true = ['apple', '...