Apr 14, 2009

R Note

An Introduction to R
非常好的入门


R or Matlab? 基本上 Matlab 要快一倍。

Windows Rgui 的语言设置,在安装目录 etc 文件夹下,Rconsole文件,language 选项,设为 en

获取帮助:
help.start() 

help(solve) #equivalent to 
?solve

help.search(solve)
??solve

example(topic) #example of some topics

use '#' to commenting.

R does not distinguish between row and column vectors

Commands are separated either by a semi-colon (‘;’), or by a newline. Elementary commands can be grouped together into one compound expression by braces (‘{’ and ‘}’).

在R中执行script:
> source("commands.R")
> sink("record.lis") #将输出定向到文件
> sink() #输出重新定向到 console

初始化操作,在默认 working directory 下,建立 .Rprofile 文件,可以执行初始化。如果不知道默认 working directory,执行 getwd()

workspace, objects:
>objects() #equivalent to ls()
>ls() #list all objects in workspace
>rm(x,y) #remove x and y from workspace

rm(list=ls())
可以清除所有变量

R 可选择将 objects 和 command history 存入当前目录 (.RData .Rhistory. ),下次从同一目录启动 R 时,会导入这些数据。所以不同的 project 最好在不同的目录。

目录操作:
oldwd <- getwd()
setwd("E:/bugs/ex2")

setwd(oldwd)

R is case sensitive.

特殊字符:
NaN (not a number), NA (not available), Inf
is.na() 可以检查 NA 和 NaN
is.nan() 只能检查 NaN

查看 object 的各种信息
summary()
mode() #查看或设置类型
dim() #维数
names() #
colnames()
str() #Compactly Display the Structure of an Arbitrary R Object

Package:
> library() # To see which packages are installed at your site

> library(package) # To load a particular package

安装 package:
  1. Users connected to the Internet can use the install.packages() and update.packages() functions (available through the Packages menu in the Windows and RAqua GUIs, see Installing packages) to install and update packages. 直接使用install.packages()可以选择 mirror 和 package. 类似于 linux 的 yum.
  2. 安装下载的 package(以 tar.gz 结尾),使用 shell 命令(bash 中或dos下),R CMD INSTALL [-l lib] pkg,eg, C:\Program Files\R\R-2.8.1\bin>R CMD INSTALL "C:\Documents and Settings\kevin\Desktop\download\lda_1.0.1.tar.gz" [1]
[1] windows 中,由于 cygwin 编译的原因,不能使用全局路径,把 tar.gz 放在 bin 目录下,去除绝对路径。另外还需要装 HTML Help Workshop,用来生成 help.

To see which packages are currently loaded, use

> search()

To display the search list. Some packages may be loaded but not available on the search list (see Namespaces): these will be included in the list given by

> loadedNamespaces()

Namespace:
There are two operators that work with namespaces. The double-colon operator :: selects definitions from a particular namespace. In the example above, the transpose function will always be available as base::t, because it is defined in the base package. Only functions that are exported from the package can be retrieved in this way.

The triple-colon operator ::: may be seen in a few places in R code: it acts like the double-colon operator but also allows access to hidden objects. Users are more likely to use the getAnywhere() function, which searches multiple packages.

Generic function:
泛型函数使得函数(function)对于不同的类有不同的实现(method)。此时函数命名为 function.class, 比如 predict.glmnet。调用 predict,如果参数 object 类型为 glmnet,则会自动使用 predict.glmnet.
methods(class="A") 查找类 A 所有的泛型函数
methods(func) 查找该泛型函数可以适用的类
> methods(class="glmnet")
[1] coef.glmnet    plot.glmnet    predict.glmnet print.glmnet  
> methods(predict)
 [1] predict.ar*                predict.Arima*            
 [3] predict.arima0*            predict.elnet             
 [5] predict.glm                predict.glmnet            
 [7] predict.HoltWinters*       predict.lm                
 [9] predict.loess*             predict.lognet            
[11] predict.mlm                predict.multnet           
[13] predict.nls*               predict.poly              
[15] predict.ppr*               predict.prcomp*           
[17] predict.princomp*          predict.smooth.spline*    
[19] predict.smooth.spline.fit* predict.StructTS*         

   Non-visible functions are asterisked
> 

字符串:
单引号或双引号都可以。有 C-style 的转向('\')。

逻辑类型和操作:逻辑值为 TRUE,FALSE,全大写
>, >=, <, <=, ==, != 产生逻辑值,也可以直接赋值 TRUE,FALSE
逻辑操作 &, |, !

逻辑值在运算中可以当做数值,FALSE为0, TRUE 为1
逻辑向量可以用于 indexing,类似于 Matlab

Vector indexing:
  • logic vector
  • positve number
  • negative number 负值表示“去掉”,-1 表示去掉第一个
  • naming
> x[is.na(x)] <- 0 将 missing number 赋值为0

     > y[y < 0] <- -y[y < 0] 等价于

     > y <- abs(y)


types of object:
R 中基本类型称为 mode,有 numeric, complex, logical, character and raw. 其他应该都是 objects
  • Vectors are the most important type of object in R. Vector 中元素的 mode 必须一致。
  • matrices or more generally arrays are multi-dimensional generalizations of vectors. In fact, they are vectors that can be indexed by two or more indices and will be printed in special ways. See Arrays and matrices.
  • factors provide compact ways to handle categorical data. See Factors.
  • lists are a general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists. Lists provide a convenient way to return the results of a statistical computation. See Lists.
  • data frames are matrix-like structures, in which the columns can be of different types. Think of data frames as `data matrices' with one row per observational unit but with (possibly) both numerical and categorical variables. Many experiments are best described by data frames: the treatments are categorical but the response is numeric. See Data frames.
  • functions are themselves objects in R which can be stored in the project's workspace. This provides a simple and convenient way to extend R
class() 函数返回 object 的 class,如果是 vector,返回其 mode.
这些类型也都有相应的 is.type(), as.type() 函数,来判断和转化

赋值:
R的赋值的语法比较松散,可以对函数的输出进行赋值(借用C++的语法,应该是输出引用)
> x = 1:10
> length(x)
[1] 10
> length(x) <- 3
> x
[1] 1 2 3
> 

Array&Matrix:
Z <- array(data_vector, dim_vector)

Function on the fly and function as object
> x = 1:10
> f <- function(x, y) cos(y)/(1 + x^2)
> x <- seq(-1,1,by=0.1)
> y <- seq(-1,1,by=0.1)
> z <- outer(x,y,f)
>

矩阵/向量操作用 %operator%
比如 %*% 表示相乘

读写数据,基本上 R 遵循的是“行为 record,列为 feature”.
R 对数据文件的要求是很严格的,它的出发点是建议你用其他程序如 python 来使得数据满足格式要求.
当同时有或者同时没有 header line 和 row lables 时,使用如下语句
x <- read.table("file")
当只有 header line 时,使用
x <- read.table("file", header=TRUE)

write.table(x,"file")

numeric(0) 指产生一个长度为 0 的numeric vector

scan() 可以以顺序方式以规定的格式读入数据,格式如 list(0,0,0,"")表示三个数值和一个字符串
list 的indexing 使用 [[i]],如果格式只有一种,而读成一个向量,一下代码,将读入一个矩阵
> x = 1:10
z<-matrix(scan("x.txt",0),ncol=5,byrow=TRUE)

随机分布
Distribution R name additional arguments
beta beta shape1, shape2, ncp
binomial binom size, prob
Cauchy cauchy location, scale
chi-squared chisq df, ncp
exponential exp rate
F f df1, df2, ncp
gamma gamma shape, scale
geometric geom prob
hypergeometric hyper m, n, k
log-normal lnorm meanlog, sdlog
logistic logis location, scale
negative binomial nbinom size, prob
normal norm mean, sd
Poisson pois lambda
Student's t t df, ncp
uniform unif min, max
Weibull weibull shape, scale
Wilcoxon wilcox m, n

Prefix the name given here by `d' for the density, `p' for the CDF, `q' for the quantile function and `r' for simulation (random deviates).

Matrix Indexing 注意:的优先级比较高,所以在计算 index 的起始/结尾时,别忘了加括号
> idx = c(1,2,3,4,5)
> idx[2:3]
[1] 2 3
> idx[2+1:3]
[1] 3 4 5
> idx[(2+1):3]
[1] 3

sample(x, size, replace = FALSE, prob = NULL)
x 为一个向量或一个正整数 N,size 默认为向量长度或该正整数 N
所以 sample 默认为向量的 permutation 或者 1:N 的permutation

写矩阵
write.matrix(x,file='') # 这是在package MASS 中,所以要先导入 MASS
若为 sparse matrix,则使用
writeMM(x,'file')

2 comments:

Yifan Yang said...

http://mlg.eng.cam.ac.uk/dave/rmbenchmark.php上做R和Matlab比较的作者明显不是很擅长写R代码,比如比较
Generating 107 uniform(0,1) random numbers
的时候,直接使用复制粘贴都有
user system elapsed
0.02 0.08 0.09
的速度(s),但作者居然得到了1.136s的数据...我怀疑作者使用了循环而不是向量化.

还有诸如排序的测试,作者应该没有使用sort提供的快速排序选项...那个是matlab默认的吧...

但是,很多情况下R的速度确实比Matlab慢。基于矩阵运算的话,虽然同样是基于cBlas库,但参数不一样结果不一样。(R可以选取特定的BLAS库)
此外,R可以不在Cygwin下运行啊。Cygwin下运行有的库不是很靠谱啊,而且速度有影响。

dhp said...

你说的挺有道理,有时间我自己测一测。