Statistical analysis with R
NOTE: same tutorial as in the repository ORCRI –>(Online Ressources for Crime Reporting and Investigation).
Credits: notes from the course Introduction à la statistique avec R, Paris Saclay
Statistics beginners course
- Review some notions: https://www.khanacademy.org/math/statistics-probability/analyzing-categorical-data
Books
- RStudio : https://rstudio-education.github.io/hopr/starting.html
- R and data sciences: https://r4ds.had.co.nz/
Data
- For this tutorial here
Fancy
- R Markdown to produce beautiful reports: https://shiny.rstudio.com/articles/rmarkdown.html
Download and installation (Ubuntu 20)
Download and install the last release of R from the official website : https://cran.r-project.org/
There is a apt for it:
sudo apt install r-base
There also a links for precompiled binary versions, available for Windows, Mac and Linux.
Rstudio
As Rstudio for Ubuntu 20 isn’t available, download the one for Ubuntu18 Jessie.
dpkg -i rstudio-1.3.1073-amd64.deb
You might need to run another command to fix dependancies issues.
sudo apt --fix-broken install
Launch R interpreter, run Rstudio
Start Rstudio from the launchpad of Ubuntu
or/and
$R runs the R interpreter in your terminal
If you need a graphic interface, dl Rstudio https://rstudio.com/products/rstudio/download/
Run RStudio
First install packages you’ll need to draw plots:
Tools -> Install Packages... -> and select package you want to install in the dialog box; here "gplots"
Starting key points
- In addition to the visual interface, you can launch instructions directly in the R interpretor (the console) and see the output (crucial for errors)
- Graphs are generated in another window.
- R scripts have the extention .r
-
you hit a script by running the instruction source()
- You call the help menu by running the instruction help().
Variables
- variables names are Case-senSitiVe
- Missing value is encoded such as NA
- mode() display the variable type numeric or character and lenght() the number of item in the variable
- factor() and levels() are helps to qualify a variable: factor() for what qualitative value of the variable, labels= for labelling the variable
to keep in mind : Variable = (element1, element2, element2, element3,…) it’s like supervariable= [] and a lot of things inthere.
Load data
tips: Cat your file before to see what is the separator used in your data set. For example with a .csv file a “,” or a “;” as separator differs the input syntax: read.csv for “,” and read.csv2 for “;”
> smp.c <-read.csv2[file path]
# smp here is the file in which I lead a set of data with "<-" and the .c method and read from a file.
So with our data: > smp.c <-read.csv2(“data/smp1.csv”)
To check the data integrity :
> str(smp.c) ~~~R 'data.frame': 799 obs. of 9 variables: $ age : int 31 49 50 47 23 34 24 52 42 45 ... $ prof : Factor w/ 8 levels "agriculteur",..: 3 NA 7 6 8 6 3 2 6 6 ... $ dep.cons : int 0 0 0 0 1 0 1 0 1 0 ... $ scz.cons : int 0 0 0 0 0 0 0 0 0 0 ... $ grav.cons: int 1 2 2 1 2 1 5 1 5 5 ... $ n.enfant : int 2 7 2 0 1 3 5 2 1 2 ... $ rs : int 2 2 2 2 2 1 3 2 3 2 ... $ ed : int 1 2 3 2 2 2 3 2 3 2 ... $ dr : int 1 1 2 2 2 1 2 2 1 2 ... ~~~
Visualizations
barplot
> barplot(table(smp.c$prof))
The same thing as table in the console:
> table(smp.c$prof) ~~~R
agriculteur artisan autre cadre employe ouvrier prof.intermediaire
6 90 31 24 135 227 58
sans emploi
222 ~~~
### pie
> pie(table(smp.c$prof))
histogram
> hist(smp.c$age) ![img](/pictures/hist.png)
vocabulary:
http://vita.had.co.nz/papers/layered-grammar.pdf