Statistical analysis with R


Statistical analysis with R

NOTE: same tutorial as in the repository ORCRI –>(Online Ressources for Crime Reporting and Investigation).

Credits: notes from the course Introduction à la statistique avec R, Paris Saclay

Statistics beginners course

  • Review some notions: https://www.khanacademy.org/math/statistics-probability/analyzing-categorical-data

Books

  • RStudio : https://rstudio-education.github.io/hopr/starting.html
  • R and data sciences: https://r4ds.had.co.nz/

Data

  • For this tutorial here

Fancy

  • R Markdown to produce beautiful reports: https://shiny.rstudio.com/articles/rmarkdown.html

Download and installation (Ubuntu 20)

Download and install the last release of R from the official website : https://cran.r-project.org/

There is a apt for it:

sudo apt install r-base

There also a links for precompiled binary versions, available for Windows, Mac and Linux.

Rstudio

As Rstudio for Ubuntu 20 isn’t available, download the one for Ubuntu18 Jessie.

dpkg -i rstudio-1.3.1073-amd64.deb

You might need to run another command to fix dependancies issues.

sudo apt --fix-broken install

Launch R interpreter, run Rstudio

Start Rstudio from the launchpad of Ubuntu

or/and

$R runs the R interpreter in your terminal

If you need a graphic interface, dl Rstudio https://rstudio.com/products/rstudio/download/

Run RStudio

First install packages you’ll need to draw plots:

    Tools -> Install Packages... ->  and select package you want to install in the dialog box; here "gplots"

Starting key points

  • In addition to the visual interface, you can launch instructions directly in the R interpretor (the console) and see the output (crucial for errors)
  • Graphs are generated in another window.
  • R scripts have the extention .r
  • you hit a script by running the instruction source()

  • You call the help menu by running the instruction help().

Variables

  • variables names are Case-senSitiVe
  • Missing value is encoded such as NA
  • mode() display the variable type numeric or character and lenght() the number of item in the variable
  • factor() and levels() are helps to qualify a variable: factor() for what qualitative value of the variable, labels= for labelling the variable

to keep in mind : Variable = (element1, element2, element2, element3,…) it’s like supervariable= [] and a lot of things inthere.

Load data

tips: Cat your file before to see what is the separator used in your data set. For example with a .csv file a “,” or a “;” as separator differs the input syntax: read.csv for “,” and read.csv2 for “;”

   > smp.c <-read.csv2[file path] 
   # smp here is the file in which I lead a set of data with "<-" and the .c method and read from a file.

So with our data: > smp.c <-read.csv2(“data/smp1.csv”)

To check the data integrity :

    > str(smp.c)  ~~~R  'data.frame':	799 obs. of  9 variables:  $ age      : int  31 49 50 47 23 34 24 52 42 45 ...  $ prof     : Factor w/ 8 levels "agriculteur",..: 3 NA 7 6 8 6 3 2 6 6 ...  $ dep.cons : int  0 0 0 0 1 0 1 0 1 0 ...  $ scz.cons : int  0 0 0 0 0 0 0 0 0 0 ...  $ grav.cons: int  1 2 2 1 2 1 5 1 5 5 ...  $ n.enfant : int  2 7 2 0 1 3 5 2 1 2 ...  $ rs       : int  2 2 2 2 2 1 3 2 3 2 ...  $ ed       : int  1 2 3 2 2 2 3 2 3 2 ...  $ dr       : int  1 1 2 2 2 1 2 2 1 2 ...  ~~~

Visualizations

barplot

    > barplot(table(smp.c$prof))

img

The same thing as table in the console:

    > table(smp.c$prof)  ~~~R
  agriculteur            artisan              autre              cadre            employe            ouvrier prof.intermediaire 
             6                 90                 31                 24                135                227                 58 
   sans emploi 
           222   ~~~

### pie

    > pie(table(smp.c$prof))

img

histogram

    > hist(smp.c$age) ![img](/pictures/hist.png)

vocabulary:

http://vita.had.co.nz/papers/layered-grammar.pdf