id name salary start_date dept 1 1 Rick 623.302012-01-01 IT 2 2 Dan 515.202013-09-23 Operations 3 3 Michelle 611.002014-11-15 IT 4 4 Ryan 729.002014-05-11 HR 5 NA Gary 843.252015-03-27 Finance 6 6 Nina 578.002013-05-21 IT 7 7 Simon 632.802013-07-30 Operations 8 8 Guru 722.502014-06-17 Finance
id name salary start_date dept 3 3 Michelle 611.002014-11-15 IT 4 4 Ryan 729.002014-05-11 HR 5 NA Gary 843.252015-03-27 Finance 8 8 Guru 722.502014-06-17 Finance
写入 CSV 文件
R 语言可以创建 CSV 文件形式的现有数据框。write.csv() 函数用于创建 CSV 文件。此文件在工作目录中创建。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Create a data frame. data <- read.csv("input.csv") retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
# Write filtered data into a new file. write.csv(retval, "output.csv") newdata <- read.csv("output.csv") print(newdata)
# 当我们执行上面的代码,它产生以下结果:
X id name salary start_date dept 1 33 Michelle 611.002014-11-15 IT 2 44 Ryan 729.002014-05-11 HR 3 5NA Gary 843.252015-03-27 Finance 4 88 Guru 722.502014-06-17 Finance
这里列 X 来自数据集 newper。这可以在写入文件时使用附加参数 row.names = FALSE 删除。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Create a data frame. data <- read.csv("input.csv") retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
# Write filtered data into a new file. write.csv(retval, "output.csv", row.names = FALSE) newdata <- read.csv("output.csv") print(newdata)
# 当我们执行上面的代码,它产生以下结果:
id name salary start_date dept 1 3 Michelle 611.002014-11-15 IT 2 4 Ryan 729.002014-05-11 HR 3 NA Gary 843.252015-03-27 Finance 4 8 Guru 722.502014-06-17 Finance
通过使用 read.xlsx() 函数读取 input.xlsx,以下脚本读取第一个工作表的数据。结果作为数据框存储在 R 语言环境中。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Read the first worksheet in the file input.xlsx. data <- read.xlsx("input.xlsx", sheetIndex = 1) print(data)
# 当我们执行上面的代码,它产生以下结果:
id name salary start_date dept 1 1 Rick 623.302012-01-01 IT 2 2 Dan 515.202013-09-23 Operations 3 3 Michelle 611.002014-11-15 IT 4 4 Ryan 729.002014-05-11 HR 5 NA Gary 843.252015-03-27 Finance 6 6 Nina 578.002013-05-21 IT 7 7 Simon 632.802013-07-30 Operations 8 8 Guru 722.502014-06-17 Finance
我们可以通过设置 sheetIndex 参数读取指定的工作表。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Read the worksheet named by city in the file input.xlsx. data <- read.xlsx("input.xlsx", sheetIndex = "city") print(data)
# 当我们执行上面的代码,它产生以下结果:
name city 1 Rick Seattle 2 Dan Tampa 3 Michelle Chicago 4 Ryan Seattle 5 Gary Houston 6 Nina Boston 7 Simon Mumbai 8 Guru Dallas
写入 Excel 文件
R 语言中的 write.xlsx() 函数用于创建 Excel 文件,此文件在工作目录中创建。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Read the first worksheet in the file input.xlsx. data <- read.xlsx("input.xlsx", sheetIndex = 1) retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
# Write the data into a new file. write.xlsx(retval, file = "output.xlsx", row.names = FALSE, sheetName = "salary") newdata <- read.xlsx("output.xlsx", sheetIndex = "salary") print(newdata)
# 当我们执行上面的代码,它产生以下结果:
id name salary start_date dept 1 3 Michelle 611.002014-11-15 IT 2 4 Ryan 729.002014-05-11 HR 3 NA Gary 843.252015-03-27 Finance 4 8 Guru 722.502014-06-17 Finance
二进制文件必须由特定程序读取才能使用。例如,Microsoft Word 程序的二进制文件只能通过 Word 程序读取到人类可读的形式。这表示,除了人类可读的文本之外,还有更多的信息,例如字符和页码等的格式化,它们也与字母数字字符一起存储。最后一个二进制文件是一个连续的字节序列。我们在文本文件中看到的换行符是连接第一行到下一行的字符。
有时,由其他程序生成的数据需要由 R 作为二进制文件处理。另外,R 语言是创建可以与其他程序共享的二进制文件所必需的。
R 语言有两个函数 WriteBin() 和 readBin() 来创建和读取二进制文件。
语法
1 2
writeBin(object, con) readBin(con, what, n)
以下是所使用的参数的说明:
con 是读取或写入二进制文件的连接对象。
object 是要写入的二进制文件。
what 是像字符,整数等代表字节模式被读取。
n 是从二进制文件读取的字节数。
写入二进制文件
我们考虑 R 语言内置数据 mtcars。首先,我们从它创建一个 CSV 文件,并将其转换为二进制文件,并将其存储为操作系统文件。接下来我们读取这个创建的二进制文件。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# Read the "mtcars" data frame as a csv file and store only the columns "cyl", "am" and "gear". write.table(mtcars, file = "mtcars.csv", row.names = FALSE, na = "", col.names = TRUE, sep = ",")
# Store 5 records from the csv file as a new data frame. new.mtcars <- read.table("mtcars.csv", sep = ",", header = TRUE, nrows = 5)
# Create a connection object to write the binary file using mode "wb". write.filename = file("binmtcars.dat", "wb")
# Write the column names of the data frame to the connection object. writeBin(c("cyl", "am", "gear"), write.filename)
# Write the records in each of the column to the file. writeBin(c(new.mtcars$cyl, new.mtcars$am, new.mtcars$gear), write.filename)
# Close the file for writing so that it can be read by other program. close(write.filename)
# Create a connection object to read the file in binary mode using "rb". read.filename <- file("binmtcars.dat", "rb")
# First read the column names. n = 3 as we have 3 columns. column.names <- readBin(read.filename, character(), n = 3)
# Next read the column values. n = 18 as we have 3 column names and 15 values. read.filename <- file("binmtcars.dat", "rb") bindata <- readBin(read.filename, integer(), n = 18)
# Print the data. print(bindata)
# Read the values from 4th byte to 8th byte which represents "cyl". cyldata = bindata[4:8] print(cyldata)
# Read the values form 9th byte to 13th byte which represents "am". amdata = bindata[9:13] print(amdata)
# Read the values form 9th byte to 13th byte which represents "gear". geardata = bindata[14:18] print(geardata)
# Combine all the read values to a dat frame. finaldata = cbind(cyldata, amdata, geardata) colnames(finaldata) = column.names print(finaldata)
# Load the packages required to read XML files. library("XML") library("methods")
# Give the input file name to the function. result <- xmlParse(file = "input.xml")
# Exract the root node form the xml file. rootnode <- xmlRoot(result)
# Get the first element of the first node. print(rootnode[[1]][[1]])
# Get the fifth element of the first node. print(rootnode[[1]][[5]])
# Get the second element of the third node. print(rootnode[[3]][[2]])
# 当我们执行上面的代码,它产生以下结果:
1 IT Michelle
XML 到数据框
为了在大文件中有效地处理数据,我们将 XML 文件中的数据作为数据框读取。然后处理数据框以进行数据分析。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# Load the packages required to read XML files. library("XML") library("methods")
# Convert the input xml file to a data frame. xmldataframe <- xmlToDataFrame("input.xml") print(xmldataframe)
# 当我们执行上面的代码,它产生以下结果:
ID NAME SALARY STARTDATE DEPT 1 1 Rick 623.32012-01-01 IT 2 2 Dan 515.22013-09-23 Operations 3 3 Michelle 6112014-11-15 IT 4 4 Ryan 7292014-05-11 HR 5 5 Gary 843.252015-03-27 Finance 6 6 Nina 5782013-05-21 IT 7 7 Simon 632.82013-07-30 Operations 8 8 Guru 722.52014-06-17 Finance
# Load the package required to read JSON files. library("rjson")
# Give the input file name to the function. result <- fromJSON(file = "input.json")
# Convert JSON file to a data frame. json_data_frame <- as.data.frame(result)
print(json_data_frame)
# 当我们执行上面的代码,它产生以下结果:
ID Name Salary StartDate Dept 1 1 Rick 623.32012-01-01 IT 2 2 Dan 515.22013-09-23 Operations 3 3 Michelle 6112014-11-15 IT 4 4 Ryan 7292014-05-11 HR 5 5 Gary 843.252015-03-27 Finance 6 6 Nina 5782013-05-21 IT 7 7 Simon 632.82013-07-30 Operations 8 8 Guru 722.52014-06-17 Finance
Web 数据
许多网站提供数据供其用户使用。例如,世界卫生组织(WHO)以 CSV,txt 和 XML 文件的形式提供健康和医疗信息的报告。使用 R 语言程序,我们可以从这些网站以编程方式提取特定数据。R 语言中用于从网站中提取数据的一些包是 RCurl,XML 和 stringr,它们用于连接到 URL,识别文件所需的链接并将它们下载到本地环境。
安装 R 语言的包 处理 URL 和链接到文件需要以下的包。如果它们在 R 语言环境中不可用,您可以使用以下命令安装它们。
# Load the package required. library("RCurl") library("XML") library("stringr") library("plyr")
# Read the URL. url <- "https://www.geos.ed.ac.uk/~weather/jcmb_ws/"
# Gather the html links present in the webpage. links <- getHTMLLinks(url)
# Identify only the links which point to the JCMB 2015 files. filenames <- links[str_detect(links, "JCMB_2015")]
# Store the file names as a list. filenames_list <- as.list(filenames)
# Create a function to download the files by passing the URL and filename list. downloadcsv <- function (mainurl, filename) { filedetails <- str_c(mainurl, filename) download.file(filedetails, filename) }
# Now apply the l_ply function and save the files into the current R working directory. l_ply(filenames, downloadcsv, mainurl = "https://www.geos.ed.ac.uk/~weather/jcmb_ws/")
数据是关系数据库系统以规范化格式存储。因此,要进行统计计算,我们将需要非常先进和复杂的 SQL 查询。但 R 语言可以轻松地连接到许多关系数据库,如 MySQL,Oracle,SQL Server 等,并从它们获取记录作为数据框。一旦数据在 R 语言环境中可用,它就变成正常的 R 语言数据集,并且可以使用所有强大的包和函数来操作或分析。
在本教程中,我们将使用 MySQL 作为连接到 R 语言的参考数据库。
R 语言有一个名为 RMySQL 的内置包,它提供与 MySQL 数据库之间的本地连接。您可以使用以下命令在 R 语言环境中安装此软件包。
1
install.packages("RMySQL")
连接到 MySQL
一旦安装了包,我们在 R 中创建一个连接对象以连接到数据库。它使用用户名,密码,数据库名称和主机名作为输入。
1 2 3 4 5 6 7 8 9 10 11 12 13 14
# Create a connection Object to MySQL database. # We will connect to the sampel database named "testdb" that comes with MySQL installation. library("RMySQL") conn = dbConnect(MySQL(), user = 'root', password = 'root', dbname = 'testdb', host = 'localhost', port=3306)
# Set the encoding method to gbk dbSendQuery(conn, 'SET NAMES gbk')
# List the tables available in this database. dbListTables(conn)
我们可以使用函数 dbSendQuery() 查询 MySQL 中的数据库表。查询在 MySQL 中执行,并使用 R 语言 fetch() 函数返回结果集。最后,它被存储为 R 语言中的数据框。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Query the "tbl" tables to get all the rows. result = dbSendQuery(conn, "SELECT * FROM tbl")
# Store the result in a R data frame object. n = 6 is used to fetch first 6 rows. data.frame = fetch(result, n = 6) print(data.frame)
# 当我们执行上面的代码,它产生以下结果:
id title author submission_date 1 1 学习 PHP PHP 2021-01-12 2 2 学习 MySQL MySQL 2021-01-12 3 3 学习 C++ C++ 2021-01-01 4 4 学习 Python Python 2021-01-01 5 5 MySQL 教程 MySQL 2021-01-12 6 6 JAVA 教程 JAVA 2021-01-12
带过滤条件的查询
我们可以传递任何有效的 SELECT 查询来获取结果。
1 2 3 4 5 6 7 8 9 10 11 12
# Query the "tbl" tables to get all the rows with author equal to 'MySQL'. result = dbSendQuery(conn, "SELECT * FROM tbl WHERE author = 'MySQL'")
# Fetch all the records and store it as a data frame. data.frame = fetch(result) print(data.frame)
# 当我们执行上面的代码,它产生以下结果:
id title author submission_date 1 2 学习 MySQL MySQL 2021-01-12 2 5 MySQL 教程 MySQL 2021-01-12
# Insert data to the "tbl" tables dbSendQuery(conn, "INSERT INTO tbl (title, author, submission_date) VALUES('学习 R', 'R', '2021-01-22')" )
# Query the "tbl" tables to get all the rows. result = dbSendQuery(conn, "SELECT * FROM tbl")
# Fetch all the records and store it as a data frame. data.frame = fetch(result) print(data.frame)
# 在执行上面的代码后,我们可以看到插入到 MySQL 环境中的表中的行。
id title author submission_date 1 1 学习 PHP PHP 2000-01-12 2 2 学习 MySQL MySQL 2021-01-12 3 3 学习 C++ C++ 2021-01-01 4 4 学习 Python Python 2021-01-01 5 5 MySQL 教程 MySQL 2021-01-12 6 6 JAVA 教程 JAVA 2021-01-12 7 7 学习 R R 2021-01-22
创建表
我们可以在 MySQL 中使用函数 dbWriteTable() 创建表。如果表已经存在,它将覆盖该表,并将数据框用作输入。
# Create the connection object to the database where we want to create the table. conn = dbConnect(MySQL(), user = 'root', password = 'root', dbname = 'testdb', host = 'localhost', port = '3306')
# Use the R data frame "mtcars" to create the table in MySQL. # All the rows of mtcars are taken inot MySQL. dbWriteTable(conn, "mtcars", mtcars[, ], overwrite = TRUE)
# List the tables available in this database. dbListTables(conn)
# Query the "mtcars" tables to get the rows result = dbSendQuery(conn, "SELECT * FROM mtcars")
# Fetch all the records and store it as a data frame. data.frame = fetch(result) print(data.frame)