Newest 'data-cleaning' Questions - Stack Overflow

Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

-1
votes
0answers
14 views

Best practice for selecting fair test set given bad data

We develop machine learning models, mostly using scikit-learn. Like most real-world data, our training datasets have a large number of (somewhat) identifiable "bad rows." In this case, BAD is ...
0
votes
0answers
14 views

What is the better way to use tokenization to get more accuracy?

What extra features can I add for cleaning the text in this code to improve accuracy ? Is it better to use nltk.sent_tokenize and then nltk.word_tokenize or we can directly use nltk.word_tokenize and ...
0
votes
1answer
23 views

Match rows by id value in python pandas

I need to match contacts in a database by how they were contacted by a unique ID number. I've created a very small mock dataframe below to help with a suggestion: data = [['email', 'emailperson1@...
1
vote
4answers
64 views

how to write R code to delete duplicate rows where one observation is the negative value of the duplicate?

I have some sales data where mistakes recorded at the point of sale are corrected afterward and the data set still contains records for the initial mistake then a duplicate of the mistake but with a ...
0
votes
1answer
18 views

PySpark transform dataframe

Let's say I have the following data in a dataframe receipts: Id | Fruits 1 | ['apple', 'banana'] 2 | ['apple'] 3 | ['pear'] 4 | ['pear', 'banana'] And I want to ...
-1
votes
0answers
25 views

Converting first few rows into columns

I have dataset where the data stored in rows. There is title row followed by 3 rows, theses 3 rows are similar across the dataset and I want to convert into columns. I want to preserve the titles in ...
0
votes
2answers
45 views

Data Cleaning Python: Replacing the values of a column not within a range with NaN and then dropping the raws which contain NaN

I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python. My Dataset in Excel: I want to replace the big values of column A (not ...
0
votes
2answers
59 views

Handle missing values : When 99% of the data is missing from most columns (important ones)

I am facing a dilemma with a project of mine. Few of the variables don't have enough data that means almost 99% data observations are missing. I am thinking of couple of options - Impute missing ...
0
votes
2answers
19 views

How do I match similar names to a given row if they appear in one year and not the next and appear again?

Actual Question (couldn't add to title because it's too long): I have facility names in a list of list, where each list is for a corresponding year. I want to create a data frame, with each row ...
-2
votes
1answer
28 views

How to modify (correct) values that are poorly written in a DataFrame with python

I have a csv file that contains values that are badly written. I want to correct these mistakes. for example replace Toyouta by toyota, maxda by mazda, in the column named carCompany. . The job I ...
-4
votes
2answers
35 views

How to count the number of columns containing a particular string in r? [closed]

I have some data that has names and tags associated with those names. There are upto 94 tags for each name. Each tag is in a separate column. I need to count the number of columns that contain a ...
-1
votes
2answers
36 views

How to fix a regular expression form for scrapped url data via python?

I am trying to clean my url data using regular expression. I have already cleaned it bypass, but I have a last problem that I don't know how to solve. It is a data that I have scrapped from some ...
-1
votes
1answer
23 views

How would one write a regex command to convert string to datetime format in python?

How would one write the regex command to convert "yyyy-mm-ddXhh:mmY[GMT]" to a valid datetime format in python? I am new to regex and tried the following, but it didnt work df[df.columns['a']] = df[...
0
votes
1answer
32 views

How can I access a CSV column surrounded by whitespaces?

I have a .cvs file from where I have imported my data. I used Pandas Data frame. timestamp ,ty,la,lo,he,acc,v,be,x,y,z 1434838676097.07,gps,48.77,-81.3838208,220.8674103,6,41.72777754,134.6484375, I ...
0
votes
1answer
21 views

What is the most efficient process to reconstruct a dyadic dataframe from an adjacency matrix?

I apologize for what I imagine is a fairly simple question. Unfortunately while my searches on here have returned a number of results for making adjacency matrices from dyadic dataframes, I haven't ...
1
vote
0answers
19 views

Is there an R function for checking if GeoJSON objects(polygon or multi-polygon) contain points? and which GeoJSON object container which point

I have an array of point { "Sheet1": [ { "CoM ID": "1040614", "Genus": "Washingtonia", "Year Planted": "1998", "Latitude": "-37.81387927", "Longitude": "144.9817733"...
0
votes
0answers
23 views

How to find original release date of a track on spotify (not re-release date)?

I am working on a personal project with Spotify where I essentially find the most similar song to an inputted song. Basically, I have created a huge dataset of tracks on Spotify (about 550,000 tracks),...
2
votes
1answer
31 views

Is there an R function for checking if a specified GeoJSON object(polygon or multi-polygon) contains the specified point?

I have an array of point { "Sheet1": [ { "CoM ID": "1040614", "Genus": "Washingtonia", "Year Planted": "1998", "Latitude": "-37.81387927", "Longitude": "144....
0
votes
0answers
5 views

Raw data to ecg waveform

I was looking to extract ECG data from an ECG machine, which i was able to do successfully using Bluetooth which displays an array of some values calculated within certain ms. So, after receiving the ...
-1
votes
2answers
34 views

How to handle columns like 'country' and 'age groups' while making a prediction model in python?

I am much new to machine learning and while I was working on this specific data-frame, I found it difficult to handle important columns like age groups and country. Here is a link to the data-set I ...
0
votes
0answers
21 views

Is there a way Using SQL to turn an element of a composite key into a column? [duplicate]

I have a database I am pulling from that uses a composite key and assigns a rating to each row. For the purposes of analysis I need to restructure the data using SQL. I've been able to complete the ...
0
votes
2answers
21 views

return index of all factor variables that don't have a predefined name

I'm trying to write a function that will return the index of all binary variables in a data frame with the exception of a predefined variable or list of variable supplied. you can generate example ...
2
votes
2answers
34 views

Regex to include only first encounter of “-” and “.”

I have the following regex \.(?![^.]+$)|[^-0-9.] which cleans out all characters from a number and keeps only the first '.' (hence matches last) as it can be a float. However, some numbers can also be ...
0
votes
1answer
26 views

How do I make a generalized function, which could be used for any column in the dataset?

def func1(dframe,Country,column_list,Role): dframe1 = dframe[dframe.Country == Country] dframe1 = dframe1[column_list] dframe1 = dframe1[dframe1.age != ...
0
votes
2answers
41 views

Loop through rows and assign value based on condition

I have dates for each row in my dataframe and want to assign a value to a new column based on a condition of the date. Normally if I assign a value to a new column, I would do something like this: ...
0
votes
0answers
19 views

Binary classification using scikit learn

For a binary labeled classification, is it necessary that every feature must be represented by binary value? For example I have a data set that uses age as a feature, is it necessary to divide the ...
-1
votes
1answer
27 views

Getting column names from dataset

I'm trying to get the column names from a dirty dataset. The name of the column names start before the symbol "=". Is there a quick method to do this without looping over all the data? How it looks ...
0
votes
0answers
15 views

Unit of the Axis When Transforming Data; Sklearn and preprocessing

I've been looking at the different preprocessing methods available in sklearn for cleaning data. I want to make sure I'm looking at the results correctly, because I'm having issues conceptualizing ...
-1
votes
1answer
41 views

Removing all invalid characters (e.g. \uf0b7) from text

I currently have several text coming in which sometimes contains the character 'invalid character' e.g. \uf0b7 or \uf077. I don't have a way of knowing which of the invalid character codes a specific ...
0
votes
2answers
32 views

slicing pandas df based on repeated cycle values in a particular column

I have df like below (Example) index y z 0 118 . 1 118 . 2 118 . 3 116 4 116 5 110 6 110 7 104 ...
0
votes
1answer
24 views

pyzotero update fields in bulk

I am relatively new to Python, with slightly more that a year of programming experience with R. I am trying to write code that helps me update specific fields in my Zotero library so that they ...
1
vote
2answers
37 views

Getting Data in a single row into multiple rows

I have a code where I see which people work in certain groups. When I ask the leader of each group to present those who work for them, in a survey, I get a row of all of the team members. What I need ...
2
votes
0answers
39 views

Map original data from the dataset to new data using Datavec library and store it in Spark RDDs

I have a dataset that contains a latitude and longitude written like 20.55E and 30.11N. I want to replace these direction strings with an appropriate - where required. So basically, I'll map based on ...
-1
votes
1answer
41 views

Remove the version number in the data frame in R column [duplicate]

This is my original data.frame: cell counts gene TGCTACC-1 10 ALKBH5 TACACGA-1 20 KDM5C TCCTTGG-1 30 EZH2 TACGGTC-1 30 PRMT2 ...
1
vote
0answers
16 views

How to extract unique rows from a python pandas groupby object and save it in another dataframe? [duplicate]

I have a dataset of black Friday sales. The columns are User_ID, Product_ID, Gender, Occupation, Product_Category, Purchase, Marital_Status, etc. After analyzing the data, I found that the attribute ...
3
votes
1answer
50 views

How to create a matrix of density plots in R

Instead of creating different different plots for a data-frame, I want to create a matrix of density plots for a data frame where I can see all the columns in one plot. For creating it separately I am ...
0
votes
0answers
30 views

Searching Elements of one column in an excel sheet with another column in another sheet

I have a column 'Places1' in one excel sheet(say search.xls) and there is a separate pooled excel sheet with columns 'Places' and 'Geo-Details' (say master.xls). I want to search if items in column '...
-1
votes
0answers
20 views

Issue while Parsing Dates with text in them(Kaggle Data Cleaning Challenge)

I am trying to solve the Data Cleaning Challenge Day 3 extra challenge problem Question Link but I am facing some issues. What I am doing is storing the year(BCE/CE) as a separate column and the year ...
1
vote
4answers
58 views

How to separate string from numbers in R?

I have a wild and crazy text file, the head of which looks like this: 2016-07-01 02:50:35 <name redacted> hey 2016-07-01 02:51:26 <name redacted> waiting for plane to Edinburgh 2016-07-01 ...
0
votes
0answers
19 views

Remove rows from csv that contain a specific word (pandas) [duplicate]

Need to remove all rows that contain the word 'thread' for example a row in the file reads 'Post-Match Thread: Liverpool 4-0 Barcelona [4-3 on agg.]' I have tried using the code below as mentioned in ...
2
votes
3answers
53 views

How to remove special characters from csv using pandas

Currently cleaning data from a csv file. Successfully mad everything lowercase, removed stopwords and punctuation etc. But need to remove special characters. For example, the csv file contains things ...
1
vote
1answer
58 views

How to remove Stopwords from CSV file using NLTK?

Trying to remove stopwords from csv file that has 3 columns and creates a new csv file with the removed stopwords. This is successful however, the data in the new file appears across the top row ...
0
votes
1answer
43 views

Is there a way to remove punctuation from Persian text?

I want to get rid of punctuations from my text file which is an English-Persian sentence pairs data. I have tried the following code: import string import re from numpy import array, argmax, random, ...
1
vote
0answers
22 views

How to split age column which is currently in range into different age category in python

Currently I am using suicide rates overview.csv data from kaggle. I have an age column in which ages are into range. I want to categorize them into a specific age category so that i could use that for ...
1
vote
3answers
63 views

Cannot convert object as strings to int - Unable to parse string

I have a data frame with one column denoting range of Ages. The data type of the Age column in shown as string. I am trying to convert string values to numeric for the model to interpret the features. ...
1
vote
2answers
48 views

Test column for special characters or only characters / numbers

I tried finding special characters using generic regex attributes and NOT LIKE clause but have been getting confusing results. The research suggested that it does not work the way it works in SQL ...
0
votes
1answer
51 views

Data cleaning and preparation for Time-Series-LSTM

I need to prepare my Data to feed it into an LSTM for predicting the next day. My Dataset is a time series in seconds but I have just 3-5 hours a day of Data. (I just have this specific Dataset so ...
-1
votes
1answer
29 views

How to use gather on dependent columns [duplicate]

I am trying to use gather function on the data but it's not giving the required output. Sample raw data - id listen_A listen_B speak_A speak_B 1 11 21 41 51 2 12 ...
0
votes
0answers
13 views

flagging strings that appear in one vector but not another (R) [duplicate]

have is a 7,000-obs data frame with character vars A and B. There are 400 unique values for A in total, and 6,500 unique values for B. obs A B 1 TJ.D KING.B 2 GRETCHEN.W TJ.D ...
1
vote
2answers
49 views

How I can remove protected void finalize() method from my code

This problem with Java method protected void finalize() , I tried to look pervious questions about this but still can not figure it out how to solve it, So one of my project class is calling this ...