R Not Reading First Row of Csv
Reading and Writing CSV Files
Overview
Teaching: xxx min
Exercises: 0 minQuestions
How do I read data from a CSV file into R?
How practice I write information to a CSV file?
Objectives
Read in a .csv, and explore the arguments of the csv reader.
Write the altered data set up to a new .csv, and explore the arguments.
The virtually common mode that scientists store data is in Excel spreadsheets. While there are R packages designed to access data from Excel spreadsheets (e.one thousand., gdata, RODBC, XLConnect, xlsx, RExcel), users frequently notice information technology easier to save their spreadsheets in comma-separated values files (CSV) and then apply R'south built in functionality to read and manipulate the data. In this short lesson, we'll acquire how to read information from a .csv and write to a new .csv, and explore the arguments that permit you lot read and write the data correctly for your needs.
Read a .csv and Explore the Arguments
Allow'southward start by opening a .csv file containing data on the speeds at which cars of dissimilar colors were clocked in 45 mph zones in the four-corners states (CarSpeeds.csv
). Nosotros will use the congenital in read.csv(...)
function call, which reads the data in as a information frame, and assign the data frame to a variable (using <-
) and so that it is stored in R's retention. Then we will explore some of the basic arguments that can be supplied to the function. Starting time, open the RStudio project containing the scripts and information you lot were working on in episode 'Analyzing Patient Data'.
# Import the data and expect at the kickoff half dozen rows carSpeeds <- read.csv ( file = 'data/car-speeds.csv' ) head ( carSpeeds )
Color Speed Country i Bluish 32 NewMexico two Reddish 45 Arizona three Blue 35 Colorado four White 34 Arizona v Ruby-red 25 Arizona half dozen Bluish 41 Arizona
Changing Delimiters
The default delimiter of the
read.csv()
function is a comma, simply yous tin use other delimiters by supplying the 'sep' argument to the function (e.1000., typingsep = ';'
allows a semi-colon separated file to exist correctly imported - see?read.csv()
for more information on this and other options for working with different file types).
The call above volition import the information, just we have not taken advantage of several handy arguments that can be helpful in loading the data in the format nosotros want. Allow's explore some of these arguments.
The default for read.csv(...)
is to set the header
statement to TRUE
. This ways that the first row of values in the .csv is fix as header information (column names). If your data set does not take a header, set the header
argument to Simulated
:
# The first row of the data without setting the header argument: carSpeeds [ 1 , ]
Colour Speed State 1 Bluish 32 NewMexico
# The first row of the information if the header statement is set to Imitation: carSpeeds <- read.csv ( file = 'information/car-speeds.csv' , header = FALSE ) carSpeeds [ ane , ]
V1 V2 V3 ane Color Speed State
Conspicuously this is not the desired behavior for this data set, but it may exist useful if y'all have a dataset without headers.
The stringsAsFactors
Argument
In older versions of R (prior to 4.0) this was perchance the almost important argument in read.csv()
, especially if you were working with categorical data. This is because the default behavior of R was to convert grapheme strings into factors, which may make information technology hard to do such things equally supercede values. Information technology is important to be enlightened of this behaviour, which we will demonstrate. For example, permit'south say we observe out that the data collector was color blind, and accidentally recorded greenish cars as existence blueish. In social club to correct the data gear up, let'due south supervene upon 'Bluish' with 'Green' in the $Colour
column:
# Here we volition use R'south `ifelse` part, in which we provide the test phrase, # the effect if the outcome of the test is 'TRUE', and the outcome if the # result is 'Simulated'. We volition besides assign the results to the Color cavalcade, # using '<-' # Kickoff - reload the data with a header carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , stringsAsFactors = TRUE ) carSpeeds $ Colour <- ifelse ( carSpeeds $ Color == 'Bluish' , 'Dark-green' , carSpeeds $ Color ) carSpeeds $ Color
[1] "Green" "1" "Green" "5" "4" "Green" "Green" "2" "5" [10] "iv" "4" "5" "Green" "Light-green" "2" "4" "Light-green" "Light-green" [19] "5" "Dark-green" "Green" "Green" "4" "Light-green" "4" "4" "4" [28] "4" "v" "Green" "4" "5" "2" "4" "2" "ii" [37] "Dark-green" "4" "2" "4" "ii" "two" "4" "four" "5" [46] "2" "Green" "4" "4" "2" "2" "4" "5" "4" [55] "Greenish" "Dark-green" "2" "Light-green" "v" "2" "4" "Green" "Green" [64] "five" "2" "4" "four" "ii" "Green" "5" "Light-green" "4" [73] "5" "5" "Green" "Greenish" "Light-green" "Green" "Green" "5" "2" [82] "Light-green" "5" "2" "two" "4" "4" "5" "v" "5" [91] "five" "four" "4" "4" "five" "ii" "v" "2" "2" [100] "5"
What happened?!? Information technology looks like 'Blue' was replaced with 'Green', but every other color was turned into a number (as a grapheme cord, given the quote marks before and afterwards). This is because the colors of the cars were loaded as factors, and the factor level was reported post-obit replacement.
To come across the internal construction, we can apply another role, str()
. In this case, the dataframe's internal structure includes the format of each column, which is what we are interested in. str()
will exist reviewed a little more in the lesson Information Types and Structures.
# Reload the data with a header (the previous ifelse call modifies attributes) carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , stringsAsFactors = TRUE ) str ( carSpeeds )
'data.frame': 100 obs. of 3 variables: $ Color: Factor w/ 5 levels " Red","Blackness",..: 3 1 iii five iv 3 3 2 5 four ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ State: Factor w/ iv levels "Arizona","Colorado",..: 3 i 2 i 1 1 3 2 ane two ...
We can see that the $Color
and $Land
columns are factors and $Speed
is a numeric cavalcade.
Now, let's load the dataset using stringsAsFactors=FALSE
, and see what happens when nosotros try to replace 'Blue' with 'Green' in the $Color
column:
carSpeeds <- read.csv ( file = 'information/car-speeds.csv' , stringsAsFactors = Fake ) str ( carSpeeds )
'data.frame': 100 obs. of 3 variables: $ Colour: chr "Blue" " Ruby-red" "Blue" "White" ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ Country: chr "NewMexico" "Arizona" "Colorado" "Arizona" ...
carSpeeds $ Color <- ifelse ( carSpeeds $ Color == 'Bluish' , 'Light-green' , carSpeeds $ Color ) carSpeeds $ Color
[i] "Light-green" " Red" "Green" "White" "Cherry" "Green" "Green" "Blackness" "White" [ten] "Red" "Ruby" "White" "Greenish" "Green" "Black" "Red" "Green" "Greenish" [19] "White" "Greenish" "Green" "Light-green" "Scarlet" "Green" "Red" "Blood-red" "Crimson" [28] "Red" "White" "Greenish" "Red" "White" "Black" "Cherry" "Black" "Blackness" [37] "Green" "Ruby" "Black" "Red" "Black" "Black" "Crimson" "Red" "White" [46] "Black" "Dark-green" "Red" "Red" "Black" "Black" "Ruby-red" "White" "Blood-red" [55] "Dark-green" "Green" "Black" "Green" "White" "Black" "Reddish" "Green" "Green" [64] "White" "Blackness" "Red" "Red" "Black" "Light-green" "White" "Green" "Carmine" [73] "White" "White" "Green" "Green" "Light-green" "Green" "Dark-green" "White" "Black" [82] "Dark-green" "White" "Black" "Black" "Carmine" "Red" "White" "White" "White" [91] "White" "Blood-red" "Cherry" "Red" "White" "Black" "White" "Black" "Blackness" [100] "White"
That's better! And we tin come across how the data now is read as graphic symbol instead of factor. From R version 4.0 onwards we do not have to specify stringsAsFactors=FALSE
, this is the default behavior.
The as.is
Statement
This is an extension of the stringsAsFactors
statement, just gives you control over private columns. For example, if we want the colors of cars imported every bit strings, but nosotros want the names of the states imported as factors, nosotros would load the information set as:
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , as.is = 1 ) # Note, the ane applies as.is to the first column only
At present we can see that if we attempt to replace 'Bluish' with 'Dark-green' in the $Color
column everything looks fine, while trying to replace 'Arizona' with 'Ohio' in the $Country
column returns the cistron numbers for the names of states that we haven't replaced:
'data.frame': 100 obs. of iii variables: $ Color: chr "Blue" " Red" "Blue" "White" ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ Country: Factor w/ 4 levels "Arizona","Colorado",..: 3 one 2 i i 1 3 2 1 2 ...
carSpeeds $ Colour <- ifelse ( carSpeeds $ Color == 'Blue' , 'Green' , carSpeeds $ Colour ) carSpeeds $ Color
[1] "Greenish" " Crimson" "Green" "White" "Red" "Light-green" "Green" "Black" "White" [x] "Cherry-red" "Cerise" "White" "Light-green" "Green" "Black" "Red" "Green" "Light-green" [xix] "White" "Dark-green" "Dark-green" "Dark-green" "Red" "Greenish" "Ruddy" "Scarlet" "Red" [28] "Ruby-red" "White" "Dark-green" "Red" "White" "Black" "Red" "Blackness" "Black" [37] "Green" "Red" "Black" "Ruby-red" "Blackness" "Black" "Red" "Reddish" "White" [46] "Black" "Green" "Cherry" "Cherry-red" "Black" "Black" "Cherry-red" "White" "Red" [55] "Green" "Green" "Black" "Green" "White" "Blackness" "Red" "Green" "Green" [64] "White" "Black" "Red" "Red" "Blackness" "Green" "White" "Green" "Red" [73] "White" "White" "Green" "Dark-green" "Greenish" "Dark-green" "Dark-green" "White" "Blackness" [82] "Greenish" "White" "Black" "Black" "Red" "Scarlet" "White" "White" "White" [91] "White" "Red" "Red" "Carmine" "White" "Black" "White" "Black" "Blackness" [100] "White"
carSpeeds $ State <- ifelse ( carSpeeds $ Land == 'Arizona' , 'Ohio' , carSpeeds $ State ) carSpeeds $ Country
[i] "3" "Ohio" "2" "Ohio" "Ohio" "Ohio" "three" "2" "Ohio" "two" [11] "4" "4" "4" "four" "4" "iii" "Ohio" "iii" "Ohio" "4" [21] "4" "4" "three" "ii" "2" "3" "2" "four" "2" "four" [31] "three" "2" "ii" "iv" "2" "2" "3" "Ohio" "4" "2" [41] "two" "3" "Ohio" "4" "Ohio" "ii" "3" "3" "3" "two" [51] "Ohio" "iv" "4" "Ohio" "iii" "2" "4" "two" "4" "4" [61] "4" "2" "three" "2" "3" "two" "three" "Ohio" "three" "4" [71] "iv" "2" "Ohio" "4" "2" "two" "ii" "Ohio" "iii" "Ohio" [81] "4" "2" "ii" "Ohio" "Ohio" "Ohio" "four" "Ohio" "4" "4" [91] "4" "Ohio" "Ohio" "3" "two" "2" "four" "iii" "Ohio" "4"
We can run into that $Color
cavalcade is a character while $State
is a factor.
Updating Values in a Factor
Suppose we desire to go on the colors of cars as factors for another operations we want to perform. Write code for replacing 'Blueish' with 'Green' in the
$Color
column of the cars dataset without importing the data withstringsAsFactors=FALSE
.Solution
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' ) # Replace 'Bluish' with 'Green' in cars$Color without using the stringsAsFactors # or as.is arguments carSpeeds $ Color <- ifelse ( as.character ( carSpeeds $ Colour ) == 'Blue' , 'Green' , as.graphic symbol ( carSpeeds $ Color )) # Convert colors back to factors carSpeeds $ Color <- as.factor ( carSpeeds $ Color )
The strip.white
Argument
Information technology is not uncommon for mistakes to have been fabricated when the data were recorded, for example a space (whitespace) may accept been inserted before a information value. Past default this whitespace will be kept in the R environs, such that '\ Red' volition be recognized equally a different value than 'Red'. In lodge to avoid this blazon of error, use the strip.white
argument. Allow's meet how this works by checking for the unique values in the $Color
cavalcade of our dataset:
Here, the data recorder added a space before the color of the car in ane of the cells:
# Nosotros utilize the built-in unique() function to extract the unique colors in our dataset unique ( carSpeeds $ Colour )
[1] Dark-green Reddish White Carmine Black Levels: Red Blackness Light-green Crimson White
Oops, we see ii values for ruby-red cars.
Let'southward effort once again, this time importing the data using the strip.white
statement. NOTE - this argument must be accompanied by the sep
statement, by which we point the type of delimiter in the file (the comma for most .csv files)
carSpeeds <- read.csv ( file = 'data/machine-speeds.csv' , stringsAsFactors = FALSE , strip.white = TRUE , sep = ',' ) unique ( carSpeeds $ Color )
[1] "Blueish" "Red" "White" "Black"
That's meliorate!
Specify Missing Data When Loading
It is common for data sets to take missing values, or mistakes. The convention for recording missing values ofttimes depends on the private who collected the data and can be recorded as
n.a.
,--
, or empty cells " ". R recognises the reserved character stringNA
as a missing value, just not some of the examples above. Let'southward say the inflamation scale in the data set we used earlierinflammation-01.csv
actually starts at1
for no inflamation and the zero values (0
) were a missed ascertainment. Looking at the?read.csv
help page is in that location an statement nosotros could employ to ensure all zeros (0
) are read in asNA
? Possibly, in theauto-speeds.csv
data contains mistakes and the person measuring the motorcar speeds could not accurately distinguish betwixt "Black or "Blue" cars. Is in that location a way to specify more than one 'string', such as "Blackness" and "Blue", to be replaced byNA
Solution
read.csv ( file = "data/inflammation-01.csv" , na.strings = "0" )
or , in
car-speeds.csv
utilize a character vector for multiple values.read.csv ( file = 'data/motorcar-speeds.csv' , na.strings = c ( "Black" , "Blue" ) )
Write a New .csv and Explore the Arguments
Subsequently altering our cars dataset past replacing 'Blue' with 'Green' in the $Color
column, we now desire to save the output. There are several arguments for the write.csv(...)
function call, a few of which are particularly important for how the information are exported. Let'south explore these now.
# Consign the data. The write.csv() function requires a minimum of two # arguments, the data to be saved and the name of the output file. write.csv ( carSpeeds , file = 'data/car-speeds-cleaned.csv' )
If you open the file, yous'll see that it has header names, because the data had headers within R, simply that there are numbers in the first column.
The row.names
Statement
This argument allows united states of america to set the names of the rows in the output information file. R's default for this argument is TRUE
, and since it does not know what else to name the rows for the cars data set up, it resorts to using row numbers. To right this, we can fix row.names
to FALSE
:
write.csv ( carSpeeds , file = 'information/car-speeds-cleaned.csv' , row.names = FALSE )
At present we come across:
Setting Column Names
There is also a
col.names
argument, which can exist used to ready the column names for a data set without headers. If the information gear up already has headers (east.g., we used theheaders = True
statement when importing the data) and so acol.names
argument will be ignored.
The na
Argument
There are times when we want to specify sure values for NA
s in the information gear up (east.thousand., we are going to laissez passer the data to a program that only accepts -9999 equally a nodata value). In this case, we desire to set up the NA
value of our output file to the desired value, using the na argument. Let's see how this works:
# First, replace the speed in the 3rd row with NA, by using an alphabetize (square # brackets to indicate the position of the value we want to replace) carSpeeds $ Speed [ 3 ] <- NA head ( carSpeeds )
Color Speed Land 1 Blue 32 NewMexico ii Red 45 Arizona three Blue NA Colorado iv White 34 Arizona 5 Scarlet 25 Arizona half-dozen Blue 41 Arizona
write.csv ( carSpeeds , file = 'data/car-speeds-cleaned.csv' , row.names = FALSE )
At present we'll set NA
to -9999 when we write the new .csv file:
# Notation - the na argument requires a cord input write.csv ( carSpeeds , file = 'data/auto-speeds-cleaned.csv' , row.names = Fake , na = '-9999' )
And we see:
Cardinal Points
Import data from a .csv file using the
read.csv(...)
function.Understand some of the cardinal arguments available for importing the data properly, including
header
,stringsAsFactors
,as.is
, andstrip.white
.Write information to a new .csv file using the
write.csv(...)
functionUnderstand some of the key arguments available for exporting the data properly, such as
row.names
,col.names
, andna
.
Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/
Post a Comment for "R Not Reading First Row of Csv"