Skip to content Skip to sidebar Skip to footer

R Not Reading First Row of Csv

Reading and Writing CSV Files

Overview

Teaching: xxx min
Exercises: 0 min

Questions

  • How do I read data from a CSV file into R?

  • How practice I write information to a CSV file?

Objectives

  • Read in a .csv, and explore the arguments of the csv reader.

  • Write the altered data set up to a new .csv, and explore the arguments.

The virtually common mode that scientists store data is in Excel spreadsheets. While there are R packages designed to access data from Excel spreadsheets (e.one thousand., gdata, RODBC, XLConnect, xlsx, RExcel), users frequently notice information technology easier to save their spreadsheets in comma-separated values files (CSV) and then apply R'south built in functionality to read and manipulate the data. In this short lesson, we'll acquire how to read information from a .csv and write to a new .csv, and explore the arguments that permit you lot read and write the data correctly for your needs.

Read a .csv and Explore the Arguments

Allow'southward start by opening a .csv file containing data on the speeds at which cars of dissimilar colors were clocked in 45 mph zones in the four-corners states (CarSpeeds.csv). Nosotros will use the congenital in read.csv(...) function call, which reads the data in as a information frame, and assign the data frame to a variable (using <-) and so that it is stored in R's retention. Then we will explore some of the basic arguments that can be supplied to the function. Starting time, open the RStudio project containing the scripts and information you lot were working on in episode 'Analyzing Patient Data'.

                          # Import the data and expect at the kickoff half dozen rows                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              )                                          head              (              carSpeeds              )                                                  
                          Color Speed     Country i  Bluish    32 NewMexico two   Reddish    45   Arizona three  Blue    35  Colorado four White    34   Arizona v   Ruby-red    25   Arizona half dozen  Bluish    41   Arizona                      

Changing Delimiters

The default delimiter of the read.csv() function is a comma, simply yous tin use other delimiters by supplying the 'sep' argument to the function (e.1000., typing sep = ';' allows a semi-colon separated file to exist correctly imported - see ?read.csv() for more information on this and other options for working with different file types).

The call above volition import the information, just we have not taken advantage of several handy arguments that can be helpful in loading the data in the format nosotros want. Allow's explore some of these arguments.

The default for read.csv(...) is to set the header statement to TRUE. This ways that the first row of values in the .csv is fix as header information (column names). If your data set does not take a header, set the header argument to Simulated:

                          # The first row of the data without setting the header argument:                                          carSpeeds              [              1              ,                                          ]                                                  
                          Colour Speed     State 1  Bluish    32 NewMexico                      
                          # The first row of the information if the header statement is set to Imitation:                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'information/car-speeds.csv'              ,                                          header                                          =                                          FALSE              )                                          carSpeeds              [              ane              ,                                          ]                                                  
                          V1    V2    V3 ane Color Speed State                      

Conspicuously this is not the desired behavior for this data set, but it may exist useful if y'all have a dataset without headers.

The stringsAsFactors Argument

In older versions of R (prior to 4.0) this was perchance the almost important argument in read.csv(), especially if you were working with categorical data. This is because the default behavior of R was to convert grapheme strings into factors, which may make information technology hard to do such things equally supercede values. Information technology is important to be enlightened of this behaviour, which we will demonstrate. For example, permit'south say we observe out that the data collector was color blind, and accidentally recorded greenish cars as existence blueish. In social club to correct the data gear up, let'due south supervene upon 'Bluish' with 'Green' in the $Colour column:

                          # Here we volition use R'south `ifelse` part, in which we provide the test phrase,                                          # the effect if the outcome of the test is 'TRUE', and the outcome if the                                          # result is 'Simulated'. We volition besides assign the results to the Color cavalcade,                                          # using '<-'                                          # Kickoff - reload the data with a header                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          TRUE              )                                          carSpeeds              $              Colour                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Bluish'              ,                                          'Dark-green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [1] "Green" "1"     "Green" "5"     "4"     "Green" "Green" "2"     "5"      [10] "iv"     "4"     "5"     "Green" "Light-green" "2"     "4"     "Light-green" "Light-green"  [19] "5"     "Dark-green" "Green" "Green" "4"     "Light-green" "4"     "4"     "4"      [28] "4"     "v"     "Green" "4"     "5"     "2"     "4"     "2"     "ii"      [37] "Dark-green" "4"     "2"     "4"     "ii"     "two"     "4"     "four"     "5"      [46] "2"     "Green" "4"     "4"     "2"     "2"     "4"     "5"     "4"      [55] "Greenish" "Dark-green" "2"     "Light-green" "v"     "2"     "4"     "Green" "Green"  [64] "five"     "2"     "4"     "four"     "ii"     "Green" "5"     "Light-green" "4"      [73] "5"     "5"     "Green" "Greenish" "Light-green" "Green" "Green" "5"     "2"      [82] "Light-green" "5"     "2"     "two"     "4"     "4"     "5"     "v"     "5"      [91] "five"     "four"     "4"     "4"     "five"     "ii"     "v"     "2"     "2"     [100] "5"                      

What happened?!? Information technology looks like 'Blue' was replaced with 'Green', but every other color was turned into a number (as a grapheme cord, given the quote marks before and afterwards). This is because the colors of the cars were loaded as factors, and the factor level was reported post-obit replacement.

To come across the internal construction, we can apply another role, str(). In this case, the dataframe's internal structure includes the format of each column, which is what we are interested in. str() will exist reviewed a little more in the lesson Information Types and Structures.

                          # Reload the data with a header (the previous ifelse call modifies attributes)                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          TRUE              )                                          str              (              carSpeeds              )                                                  
            'data.frame':	100 obs. of  3 variables:  $ Color: Factor w/ 5 levels " Red","Blackness",..: 3 1 iii five iv 3 3 2 5 four ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: Factor w/ iv levels "Arizona","Colorado",..: 3 i 2 i 1 1 3 2 ane two ...                      

We can see that the $Color and $Land columns are factors and $Speed is a numeric cavalcade.

Now, let's load the dataset using stringsAsFactors=FALSE, and see what happens when nosotros try to replace 'Blue' with 'Green' in the $Color column:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'information/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          Fake              )                                          str              (              carSpeeds              )                                                  
            'data.frame':	100 obs. of  3 variables:  $ Colour: chr  "Blue" " Ruby-red" "Blue" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ Country: chr  "NewMexico" "Arizona" "Colorado" "Arizona" ...                      
                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Bluish'              ,                                          'Light-green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [i] "Light-green" " Red"  "Green" "White" "Cherry"   "Green" "Green" "Blackness" "White"  [ten] "Red"   "Ruby"   "White" "Greenish" "Green" "Black" "Red"   "Green" "Greenish"  [19] "White" "Greenish" "Green" "Light-green" "Scarlet"   "Green" "Red"   "Blood-red"   "Crimson"    [28] "Red"   "White" "Greenish" "Red"   "White" "Black" "Cherry"   "Black" "Blackness"  [37] "Green" "Ruby"   "Black" "Red"   "Black" "Black" "Crimson"   "Red"   "White"  [46] "Black" "Dark-green" "Red"   "Red"   "Black" "Black" "Ruby-red"   "White" "Blood-red"    [55] "Dark-green" "Green" "Black" "Green" "White" "Black" "Reddish"   "Green" "Green"  [64] "White" "Blackness" "Red"   "Red"   "Black" "Light-green" "White" "Green" "Carmine"    [73] "White" "White" "Green" "Green" "Light-green" "Green" "Dark-green" "White" "Black"  [82] "Dark-green" "White" "Black" "Black" "Carmine"   "Red"   "White" "White" "White"  [91] "White" "Blood-red"   "Cherry"   "Red"   "White" "Black" "White" "Black" "Blackness" [100] "White"                      

That's better! And we tin come across how the data now is read as graphic symbol instead of factor. From R version 4.0 onwards we do not have to specify stringsAsFactors=FALSE, this is the default behavior.

The as.is Statement

This is an extension of the stringsAsFactors statement, just gives you control over private columns. For example, if we want the colors of cars imported every bit strings, but nosotros want the names of the states imported as factors, nosotros would load the information set as:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          as.is                                          =                                          1              )                                          # Note, the ane applies as.is to the first column only                                                  

At present we can see that if we attempt to replace 'Bluish' with 'Dark-green' in the $Color column everything looks fine, while trying to replace 'Arizona' with 'Ohio' in the $Country column returns the cistron numbers for the names of states that we haven't replaced:

            'data.frame':	100 obs. of  iii variables:  $ Color: chr  "Blue" " Red" "Blue" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ Country: Factor w/ 4 levels "Arizona","Colorado",..: 3 one 2 i i 1 3 2 1 2 ...                      
                          carSpeeds              $              Colour                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blue'              ,                                          'Green'              ,                                          carSpeeds              $              Colour              )                                          carSpeeds              $              Color                                                  
                          [1] "Greenish" " Crimson"  "Green" "White" "Red"   "Light-green" "Green" "Black" "White"  [x] "Cherry-red"   "Cerise"   "White" "Light-green" "Green" "Black" "Red"   "Green" "Light-green"  [xix] "White" "Dark-green" "Dark-green" "Dark-green" "Red"   "Greenish" "Ruddy"   "Scarlet"   "Red"    [28] "Ruby-red"   "White" "Dark-green" "Red"   "White" "Black" "Red"   "Blackness" "Black"  [37] "Green" "Red"   "Black" "Ruby-red"   "Blackness" "Black" "Red"   "Reddish"   "White"  [46] "Black" "Green" "Cherry"   "Cherry-red"   "Black" "Black" "Cherry-red"   "White" "Red"    [55] "Green" "Green" "Black" "Green" "White" "Blackness" "Red"   "Green" "Green"  [64] "White" "Black" "Red"   "Red"   "Blackness" "Green" "White" "Green" "Red"    [73] "White" "White" "Green" "Dark-green" "Greenish" "Dark-green" "Dark-green" "White" "Blackness"  [82] "Greenish" "White" "Black" "Black" "Red"   "Scarlet"   "White" "White" "White"  [91] "White" "Red"   "Red"   "Carmine"   "White" "Black" "White" "Black" "Blackness" [100] "White"                      
                          carSpeeds              $              State                                          <-                                          ifelse              (              carSpeeds              $              Land                                          ==                                          'Arizona'              ,                                          'Ohio'              ,                                          carSpeeds              $              State              )                                          carSpeeds              $              Country                                                  
                          [i] "3"    "Ohio" "2"    "Ohio" "Ohio" "Ohio" "three"    "2"    "Ohio" "two"     [11] "4"    "4"    "4"    "four"    "4"    "iii"    "Ohio" "iii"    "Ohio" "4"     [21] "4"    "4"    "three"    "ii"    "2"    "3"    "2"    "four"    "2"    "four"     [31] "three"    "2"    "ii"    "iv"    "2"    "2"    "3"    "Ohio" "4"    "2"     [41] "two"    "3"    "Ohio" "4"    "Ohio" "ii"    "3"    "3"    "3"    "two"     [51] "Ohio" "iv"    "4"    "Ohio" "iii"    "2"    "4"    "two"    "4"    "4"     [61] "4"    "2"    "three"    "2"    "3"    "two"    "three"    "Ohio" "three"    "4"     [71] "iv"    "2"    "Ohio" "4"    "2"    "two"    "ii"    "Ohio" "iii"    "Ohio"  [81] "4"    "2"    "ii"    "Ohio" "Ohio" "Ohio" "four"    "Ohio" "4"    "4"     [91] "4"    "Ohio" "Ohio" "3"    "two"    "2"    "four"    "iii"    "Ohio" "4"                      

We can run into that $Color cavalcade is a character while $State is a factor.

Updating Values in a Factor

Suppose we desire to go on the colors of cars as factors for another operations we want to perform. Write code for replacing 'Blueish' with 'Green' in the $Color column of the cars dataset without importing the data with stringsAsFactors=FALSE.

Solution

                                  carSpeeds                                                      <-                                                      read.csv                  (                  file                                                      =                                                      'data/car-speeds.csv'                  )                                                      # Replace 'Bluish' with 'Green' in cars$Color without using the stringsAsFactors                                                      # or as.is arguments                                                      carSpeeds                  $                  Color                                                      <-                                                      ifelse                  (                  as.character                  (                  carSpeeds                  $                  Colour                  )                                                      ==                                                      'Blue'                  ,                                                      'Green'                  ,                                                      as.graphic symbol                  (                  carSpeeds                  $                  Color                  ))                                                      # Convert colors back to factors                                                      carSpeeds                  $                  Color                                                      <-                                                      as.factor                  (                  carSpeeds                  $                  Color                  )                                                                  

The strip.white Argument

Information technology is not uncommon for mistakes to have been fabricated when the data were recorded, for example a space (whitespace) may accept been inserted before a information value. Past default this whitespace will be kept in the R environs, such that '\ Red' volition be recognized equally a different value than 'Red'. In lodge to avoid this blazon of error, use the strip.white argument. Allow's meet how this works by checking for the unique values in the $Color cavalcade of our dataset:

Here, the data recorder added a space before the color of the car in ane of the cells:

                          # Nosotros utilize the built-in unique() function to extract the unique colors in our dataset                                          unique              (              carSpeeds              $              Colour              )                                                  
            [1] Dark-green  Reddish  White Carmine   Black Levels:  Red Blackness Light-green Crimson White                      

Oops, we see ii values for ruby-red cars.

Let'southward effort once again, this time importing the data using the strip.white statement. NOTE - this argument must be accompanied by the sep statement, by which we point the type of delimiter in the file (the comma for most .csv files)

                          carSpeeds                                          <-                                          read.csv              (                                          file                                          =                                          'data/machine-speeds.csv'              ,                                          stringsAsFactors                                          =                                          FALSE              ,                                          strip.white                                          =                                          TRUE              ,                                          sep                                          =                                          ','                                          )                                          unique              (              carSpeeds              $              Color              )                                                  
            [1] "Blueish"  "Red"   "White" "Black"                      

That's meliorate!

Specify Missing Data When Loading

It is common for data sets to take missing values, or mistakes. The convention for recording missing values ofttimes depends on the private who collected the data and can be recorded as n.a., --, or empty cells " ". R recognises the reserved character string NA as a missing value, just not some of the examples above. Let'southward say the inflamation scale in the data set we used earlier inflammation-01.csv actually starts at 1 for no inflamation and the zero values (0) were a missed ascertainment. Looking at the ?read.csv help page is in that location an statement nosotros could employ to ensure all zeros (0) are read in as NA? Possibly, in the auto-speeds.csv data contains mistakes and the person measuring the motorcar speeds could not accurately distinguish betwixt "Black or "Blue" cars. Is in that location a way to specify more than one 'string', such as "Blackness" and "Blue", to be replaced by NA

Solution

                                  read.csv                  (                  file                                                      =                                                      "data/inflammation-01.csv"                  ,                                                      na.strings                                                      =                                                      "0"                  )                                                                  

or , in car-speeds.csv utilize a character vector for multiple values.

                                  read.csv                  (                                                      file                                                      =                                                      'data/motorcar-speeds.csv'                  ,                                                      na.strings                                                      =                                                      c                  (                  "Black"                  ,                                                      "Blue"                  )                                                      )                                                                  

Write a New .csv and Explore the Arguments

Subsequently altering our cars dataset past replacing 'Blue' with 'Green' in the $Color column, we now desire to save the output. There are several arguments for the write.csv(...) function call, a few of which are particularly important for how the information are exported. Let'south explore these now.

                          # Consign the data. The write.csv() function requires a minimum of two                                          # arguments, the data to be saved and the name of the output file.                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              )                                                  

If you open the file, yous'll see that it has header names, because the data had headers within R, simply that there are numbers in the first column.

csv written without row.names argument

The row.names Statement

This argument allows united states of america to set the names of the rows in the output information file. R's default for this argument is TRUE, and since it does not know what else to name the rows for the cars data set up, it resorts to using row numbers. To right this, we can fix row.names to FALSE:

                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'information/car-speeds-cleaned.csv'              ,                                          row.names                                          =                                          FALSE              )                                                  

At present we come across:

csv written with row.names argument

Setting Column Names

There is also a col.names argument, which can exist used to ready the column names for a data set without headers. If the information gear up already has headers (east.g., we used the headers = True statement when importing the data) and so a col.names argument will be ignored.

The na Argument

There are times when we want to specify sure values for NAs in the information gear up (east.thousand., we are going to laissez passer the data to a program that only accepts -9999 equally a nodata value). In this case, we desire to set up the NA value of our output file to the desired value, using the na argument. Let's see how this works:

                          # First, replace the speed in the 3rd row with NA, by using an alphabetize (square                                          # brackets to indicate the position of the value we want to replace)                                          carSpeeds              $              Speed              [              3              ]                                          <-                                          NA                                          head              (              carSpeeds              )                                                  
                          Color Speed     Land 1  Blue    32 NewMexico ii   Red    45   Arizona three  Blue    NA  Colorado iv White    34   Arizona 5   Scarlet    25   Arizona half-dozen  Blue    41   Arizona                      
                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              ,                                          row.names                                          =                                          FALSE              )                                                  

At present we'll set NA to -9999 when we write the new .csv file:

                          # Notation - the na argument requires a cord input                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/auto-speeds-cleaned.csv'              ,                                          row.names                                          =                                          Fake              ,                                          na                                          =                                          '-9999'              )                                                  

And we see:

csv written with -9999 as NA

Cardinal Points

  • Import data from a .csv file using the read.csv(...) function.

  • Understand some of the cardinal arguments available for importing the data properly, including header, stringsAsFactors, as.is, and strip.white.

  • Write information to a new .csv file using the write.csv(...) function

  • Understand some of the key arguments available for exporting the data properly, such as row.names, col.names, and na.

thomasdombant.blogspot.com

Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/

Post a Comment for "R Not Reading First Row of Csv"