R 语言数据接口

前言

R 语言处理的数据一般从外部导入，因此需要数据接口来读取各种格式化的数据。

在 R 语言中，我们可以从存储在 R 语言环境外的文件中读取数据。我们还可以将数据写入将被操作系统存储和访问的文件。R 语言可以读取和写入各种文件格式，如 CSV，Excel，XML，JSON 等。

CSV 文件

本章节学习从 CSV 文件读取数据，然后将数据写入 CSV 文件。该文件应该存在于当前工作目录中，以便 R 语言可以读取它。当然我们也可以设置我们自己的目录并从那里读取文件。

获取和设置工作目录

您可以使用 getwd() 函数检查 R 语言工作区指向的目录。您还可以使用 setwd() 函数设置新的工作目录。

# Get and print current working directory.
print(getwd())

# Set current working directory.
setwd("E:/data")

# Get and print current working directory.
print(getwd())

# 当我们执行上面的代码，它产生以下结果：

[1] "C:/Users/user/Documents"
[1] "E:/data"

此结果取决于您的操作系统和您当前工作的目录。

创建 CSV 文件

CSV 文件是一个文本文件，其中列中的值由逗号分隔。

通过将以下数据复制到文本编辑器（如记事本）中来创建文件。使用 .csv 扩展名保存使用记事本中的保存为所有文件（*.*）选项，将文件保存为 input.csv。

id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
 ,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance

读取 CSV 文件

以下是 read.csv() 函数的一个简单示例，用于读取当前工作目录中可用的 CSV 文件。

data <- read.csv("input.csv")
print(data)

# 当我们执行上面的代码，它产生以下结果：

  id     name salary start_date       dept
1  1     Rick 623.30 2012-01-01         IT
2  2      Dan 515.20 2013-09-23 Operations
3  3 Michelle 611.00 2014-11-15         IT
4  4     Ryan 729.00 2014-05-11         HR
5 NA     Gary 843.25 2015-03-27    Finance
6  6     Nina 578.00 2013-05-21         IT
7  7    Simon 632.80 2013-07-30 Operations
8  8     Guru 722.50 2014-06-17    Finance

分析 CSV 文件

默认情况下，read.csv() 函数将输出作为数据框。这可以容易地如下检查。此外，我们可以检查列和行的数量。

data <- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))

# 当我们执行上面的代码，它产生以下结果：

[1] TRUE
[1] 5
[1] 8

一旦我们读取数据框中的数据，我们可以应用所有适用于数据框的函数。

获取最高工资

# Create a data frame.
data <- read.csv("input.csv")

# Get the max salary from data frame.
sal <- max(data$salary)
print(sal)

# 当我们执行上面的代码，它产生以下结果：

[1] 843.25

获取具有最高工资的人的详细信息

我们可以获取满足特定过滤条件的行，类似于 SQL WHERE 子句。

# Create a data frame.
data <- read.csv("input.csv")

# Get the person detail having max salary.
retval <- subset(data, salary == max(salary))
print(retval)

# 当我们执行上面的代码，它产生以下结果：

  id name salary start_date    dept
5 NA Gary 843.25 2015-03-27 Finance

获取所有的 IT 部门员工的信息

# Create a data frame.
data <- read.csv("input.csv")

retval <- subset( data, dept == "IT")
print(retval)

# 当我们执行上面的代码，它产生以下结果：

  id     name salary start_date dept
1  1     Rick  623.3 2012-01-01   IT
3  3 Michelle  611.0 2014-11-15   IT
6  6     Nina  578.0 2013-05-21   IT

获取工资大于 600 的 IT 部门的人员

# Create a data frame.
data <- read.csv("input.csv")

info <- subset(data, salary > 600 & dept == "IT")
print(info)

# 当我们执行上面的代码，它产生以下结果：

  id     name salary start_date dept
1  1     Rick  623.3 2012-01-01   IT
3  3 Michelle  611.0 2014-11-15   IT

获取 2014 年或之后加入的人

# Create a data frame.
data <- read.csv("input.csv")

retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))
print(retval)

# 当我们执行上面的代码，它产生以下结果：

  id     name salary start_date    dept
3  3 Michelle 611.00 2014-11-15      IT
4  4     Ryan 729.00 2014-05-11      HR
5 NA     Gary 843.25 2015-03-27 Finance
8  8     Guru 722.50 2014-06-17 Finance

写入 CSV 文件

R 语言可以创建 CSV 文件形式的现有数据框。write.csv() 函数用于创建 CSV 文件。此文件在工作目录中创建。

# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.
write.csv(retval, "output.csv")
newdata <- read.csv("output.csv")
print(newdata)

# 当我们执行上面的代码，它产生以下结果：

  X id     name salary start_date    dept
1 3  3 Michelle 611.00 2014-11-15      IT
2 4  4     Ryan 729.00 2014-05-11      HR
3 5 NA     Gary 843.25 2015-03-27 Finance
4 8  8     Guru 722.50 2014-06-17 Finance

这里列 X 来自数据集 newper。这可以在写入文件时使用附加参数 row.names = FALSE 删除。

# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.
write.csv(retval, "output.csv", row.names = FALSE)
newdata <- read.csv("output.csv")
print(newdata)

# 当我们执行上面的代码，它产生以下结果：

  id     name salary start_date    dept
1  3 Michelle 611.00 2014-11-15      IT
2  4     Ryan 729.00 2014-05-11      HR
3 NA     Gary 843.25 2015-03-27 Finance
4  8     Guru 722.50 2014-06-17 Finance

Excel 文件

Microsoft Excel 是最广泛使用的电子表格程序，以 .xls 或 .xlsx 格式存储数据。R 语言可以直接从这些文件使用一些 Excel 特定的包获取数据，如XLConnect，xlsx，gdata 等。

下面我们将使用 xlsx 包。R 语言也可以使用这个包写入 Excel 文件。

安装 xlsx 软件包

您可以在 R 控制台中使用以下命令来安装 xlsx 软件包。它可能会要求安装一些额外的软件包这个软件包依赖。按照具有所需软件包名称的同一命令安装其他软件包。

1	install.packages("xlsx")

验证并加载 xlsx 软件包

使用以下命令验证并加载 xlsx 软件包。

# Verify the package is installed.
any(grepl("xlsx", installed.packages()))

# Load the library into R workspace.
library("xlsx")

# 当我们执行上面的代码，它产生以下结果：

[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars

创建 xlsx 文件

打开 Microsoft Excel，将以下数据复制并粘贴到名为 sheet1 的工作表中。

id	name	salary	start_date	dept
1	Rick	623.3	2012-01-01	IT
2	Dan	515.2	2013-09-23	Operations
3	Michelle	611	2014-11-15	IT
4	Ryan	729	2014-05-11	HR
	Gary	843.25	2015-03-27	Finance
6	Nina	578	2013-05-21	IT
7	Simon	632.8	2013-07-30	Operations
8	Guru	722.5	2014-06-17	Finance

还要将以下数据复制并粘贴到另一个工作表，并将此工作表重命名为 city。

name	city
Rick	Seattle
Dan	Tampa
Michelle	Chicago
Ryan	Seattle
Gary	Houston
Nina	Boston
Simon	Mumbai
Guru	Dallas

将 Excel 文件另存为 input.xlsx。应将其保存在 R 工作区的当前工作目录中。

读取 Excel 文件

通过使用 read.xlsx() 函数读取 input.xlsx，以下脚本读取第一个工作表的数据。结果作为数据框存储在 R 语言环境中。

# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)

# 当我们执行上面的代码，它产生以下结果：

  id     name salary start_date       dept
1  1     Rick 623.30 2012-01-01         IT
2  2      Dan 515.20 2013-09-23 Operations
3  3 Michelle 611.00 2014-11-15         IT
4  4     Ryan 729.00 2014-05-11         HR
5 NA     Gary 843.25 2015-03-27    Finance
6  6     Nina 578.00 2013-05-21         IT
7  7    Simon 632.80 2013-07-30 Operations
8  8     Guru 722.50 2014-06-17    Finance

我们可以通过设置 sheetIndex 参数读取指定的工作表。

# Read the worksheet named by city in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = "city")
print(data)

# 当我们执行上面的代码，它产生以下结果：

      name    city
1     Rick Seattle
2      Dan   Tampa
3 Michelle Chicago
4     Ryan Seattle
5     Gary Houston
6     Nina  Boston
7    Simon  Mumbai
8     Guru  Dallas

写入 Excel 文件

R 语言中的 write.xlsx() 函数用于创建 Excel 文件，此文件在工作目录中创建。

# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write the data into a new file.
write.xlsx(retval, file = "output.xlsx", row.names = FALSE, sheetName = "salary")
newdata <- read.xlsx("output.xlsx", sheetIndex = "salary")
print(newdata)

# 当我们执行上面的代码，它产生以下结果：

  id     name salary start_date    dept
1  3 Michelle 611.00 2014-11-15      IT
2  4     Ryan 729.00 2014-05-11      HR
3 NA     Gary 843.25 2015-03-27 Finance
4  8     Guru 722.50 2014-06-17 Finance

二进制文件

二进制文件是包含仅以位和字节（0 和 1）的形式存储的信息的文件。它们不是人类可读的，因为它中的字节转换为包含许多其他不可打印字符的字符和符号。尝试使用任何文本编辑器读取二进制文件将显示如 Ø 和 ð 的字符。

二进制文件必须由特定程序读取才能使用。例如，Microsoft Word 程序的二进制文件只能通过 Word 程序读取到人类可读的形式。这表示，除了人类可读的文本之外，还有更多的信息，例如字符和页码等的格式化，它们也与字母数字字符一起存储。最后一个二进制文件是一个连续的字节序列。我们在文本文件中看到的换行符是连接第一行到下一行的字符。

有时，由其他程序生成的数据需要由 R 作为二进制文件处理。另外，R 语言是创建可以与其他程序共享的二进制文件所必需的。

R 语言有两个函数 WriteBin() 和 readBin() 来创建和读取二进制文件。

语法

1 2	writeBin(object, con) readBin(con, what, n)

以下是所使用的参数的说明：

con 是读取或写入二进制文件的连接对象。
object 是要写入的二进制文件。
what 是像字符，整数等代表字节模式被读取。
n 是从二进制文件读取的字节数。

写入二进制文件

我们考虑 R 语言内置数据 mtcars。首先，我们从它创建一个 CSV 文件，并将其转换为二进制文件，并将其存储为操作系统文件。接下来我们读取这个创建的二进制文件。

# Read the "mtcars" data frame as a csv file and store only the columns "cyl", "am" and "gear".
write.table(mtcars, file = "mtcars.csv", row.names = FALSE, na = "", col.names = TRUE, sep = ",")

# Store 5 records from the csv file as a new data frame.
new.mtcars <- read.table("mtcars.csv", sep = ",", header = TRUE, nrows = 5)

# Create a connection object to write the binary file using mode "wb".
write.filename = file("binmtcars.dat", "wb")

# Write the column names of the data frame to the connection object.
writeBin(c("cyl", "am", "gear"), write.filename)

# Write the records in each of the column to the file.
writeBin(c(new.mtcars$cyl, new.mtcars$am, new.mtcars$gear), write.filename)

# Close the file for writing so that it can be read by other program.
close(write.filename)

读取二进制文件

上面创建的二进制文件将所有数据存储为连续字节。因此，我们将通过选择适当的列名称值和列值来读取它。

# Create a connection object to read the file in binary mode using "rb".
read.filename <- file("binmtcars.dat", "rb")

# First read the column names. n = 3 as we have 3 columns.
column.names <- readBin(read.filename, character(),  n = 3)

# Next read the column values. n = 18 as we have 3 column names and 15 values.
read.filename <- file("binmtcars.dat", "rb")
bindata <- readBin(read.filename, integer(), n = 18)

# Print the data.
print(bindata)

# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)

# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)

# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)

# Combine all the read values to a dat frame.
finaldata = cbind(cyldata, amdata, geardata)
colnames(finaldata) = column.names
print(finaldata)

# 当我们执行上面的代码，它产生以下结果和图表 -

[1]    7108963 1728081249    7496037          6          6          4
[7]          6          8          1          1          1          0
[13]         0          4          4          4          3          3

[1] 6 6 4 6 8

[1] 1 1 1 0 0

[1] 4 4 4 3 3

     cyl am gear
[1,]   6  1    4
[2,]   6  1    4
[3,]   4  1    4
[4,]   6  0    3
[5,]   8  0    3

正如我们所看到的，我们通过读取 R 中的二进制文件得到原始数据。

XML 文件

XML 是一种文件格式，它使用标准 ASCII 文本共享万维网，内部网和其他地方的文件格式和数据。它代表可扩展标记语言 XML。类似于 HTML，它包含标记标签。但是与 HTML 中的标记标记描述页面的结构不同，在 XML 中，标记描述了包含在文件中的数据的含义。

您可以使用 XML 包读取 R 语言中的 XML 文件。此软件包可以使用以下命令安装。

1	install.packages("XML")

创建 XML 文件

通过将以下数据复制到文本编辑器（如记事本）中来创建文件。使用 .xml 扩展名保存使用记事本中的保存为所有文件（*.*）选项，将文件保存为 input.xml。

<RECORDS>
    <EMPLOYEE>
        <ID>1</ID>
        <NAME>Rick</NAME>
        <SALARY>623.3</SALARY>
        <STARTDATE>2012-01-01</STARTDATE>
        <DEPT>IT</DEPT>
    </EMPLOYEE>
    <EMPLOYEE>
        <ID>2</ID>
        <NAME>Dan</NAME>
        <SALARY>515.2</SALARY>
        <STARTDATE>2013-09-23</STARTDATE>
        <DEPT>Operations</DEPT>
    </EMPLOYEE>
    <EMPLOYEE>
        <ID>3</ID>
        <NAME>Michelle</NAME>
        <SALARY>611</SALARY>
        <STARTDATE>2014-11-15</STARTDATE>
        <DEPT>IT</DEPT>
    </EMPLOYEE>
    <EMPLOYEE>
        <ID>4</ID>
        <NAME>Ryan</NAME>
        <SALARY>729</SALARY>
        <STARTDATE>2014-05-11</STARTDATE>
        <DEPT>HR</DEPT>
    </EMPLOYEE>
    <EMPLOYEE>
        <ID>5</ID>
        <NAME>Gary</NAME>
        <SALARY>843.25</SALARY>
        <STARTDATE>2015-03-27</STARTDATE>
        <DEPT>Finance</DEPT>
    </EMPLOYEE>
    <EMPLOYEE>
        <ID>6</ID>
        <NAME>Nina</NAME>
        <SALARY>578</SALARY>
        <STARTDATE>2013-05-21</STARTDATE>
        <DEPT>IT</DEPT>
    </EMPLOYEE>
    <EMPLOYEE>
        <ID>7</ID>
        <NAME>Simon</NAME>
        <SALARY>632.8</SALARY>
        <STARTDATE>2013-07-30</STARTDATE>
        <DEPT>Operations</DEPT>
    </EMPLOYEE>
    <EMPLOYEE>
        <ID>8</ID>
        <NAME>Guru</NAME>
        <SALARY>722.5</SALARY>
        <STARTDATE>2014-06-17</STARTDATE>
        <DEPT>Finance</DEPT>
    </EMPLOYEE>
</RECORDS>

读取 XML 文件

XML 文件由 R 语言使用函数 xmlParse() 读取。它作为列表存储在 R 语言中。

# Load the package required to read XML files.
library("XML")

# Also load the other required package.
library("methods")

# Give the input file name to the function.
result <- xmlParse(file = "input.xml")

# Print the result.
print(result)

# 当我们执行上面的代码，它产生以下结果：

1
Rick
623.3
2012-01-01
IT

2
Dan
515.2
2013-09-23
Operations

3
Michelle
611
2014-11-15
IT

4
Ryan
729
2014-05-11
HR

5
Gary
843.25
2015-03-27
Finance

6
Nina
578
2013-05-21
IT

7
Simon
632.8
2013-07-30
Operations

8
Guru
722.5
2014-06-17
Finance

获取 XML 文件中存在的节点数

# Load the packages required to read XML files.
library("XML")
library("methods")

# Give the input file name to the function.
result <- xmlParse(file = "input.xml")

# Exract the root node form the xml file.
rootnode <- xmlRoot(result)

# Find number of nodes in the root.
rootsize <- xmlSize(rootnode)

# Print the result.
print(rootsize)

# 当我们执行上面的代码，它产生以下结果：

[1] 8

第一个节点的详细信息

让我们看看解析文件的第一条记录。它将给我们一个关于存在于顶层节点中的各种元素的想法。

# Load the packages required to read XML files.
library("XML")
library("methods")

# Give the input file name to the function.
result <- xmlParse(file = "input.xml")

# Exract the root node form the xml file.
rootnode <- xmlRoot(result)

# Print the result.
print(rootnode[1])

# 当我们执行上面的代码，它产生以下结果：

$EMPLOYEE
  1
  Rick
  623.3
  2012-01-01
  IT

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"

获取节点的不同元素

# Load the packages required to read XML files.
library("XML")
library("methods")

# Give the input file name to the function.
result <- xmlParse(file = "input.xml")

# Exract the root node form the xml file.
rootnode <- xmlRoot(result)

# Get the first element of the first node.
print(rootnode[[1]][[1]])

# Get the fifth element of the first node.
print(rootnode[[1]][[5]])

# Get the second element of the third node.
print(rootnode[[3]][[2]])

# 当我们执行上面的代码，它产生以下结果：

1
IT
Michelle

XML 到数据框

为了在大文件中有效地处理数据，我们将 XML 文件中的数据作为数据框读取。然后处理数据框以进行数据分析。

# Load the packages required to read XML files.
library("XML")
library("methods")

# Convert the input xml file to a data frame.
xmldataframe <- xmlToDataFrame("input.xml")
print(xmldataframe)

# 当我们执行上面的代码，它产生以下结果：

  ID     NAME SALARY  STARTDATE       DEPT
1  1     Rick  623.3 2012-01-01         IT
2  2      Dan  515.2 2013-09-23 Operations
3  3 Michelle    611 2014-11-15         IT
4  4     Ryan    729 2014-05-11         HR
5  5     Gary 843.25 2015-03-27    Finance
6  6     Nina    578 2013-05-21         IT
7  7    Simon  632.8 2013-07-30 Operations
8  8     Guru  722.5 2014-06-17    Finance

由于数据现在可以作为数据框，我们可以使用数据框相关函数来读取和操作文件。

JSON 文件

JSON 文件以人类可读格式将数据存储为文本。JSON 代表 JavaScript Object Notation。R 可以使用 rjson 包读取 JSON 文件。

在 R 语言控制台中，您可以发出以下命令来安装 rjson 包。

1	install.packages("rjson")

创建 JSON 文件

通过将以下数据复制到文本编辑器（如记事本）中来创建文件。使用 .json 扩展名保存使用记事本中的保存为所有文件（*.*）选项，将文件保存为 input.json。

{
	"ID": ["1", "2", "3", "4", "5", "6", "7", "8"],
	"Name": ["Rick", "Dan", "Michelle", "Ryan", "Gary", "Nina", "Simon", "Guru"],
	"Salary": ["623.3", "515.2", "611", "729", "843.25", "578", "632.8", "722.5"],

	"StartDate": ["2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27", "2013-05-21",
		"2013-07-30", "2014-06-17"
	],
	"Dept": ["IT", "Operations", "IT", "HR", "Finance", "IT", "Operations", "Finance"]
}

读取 JSON 文件

JSON 文件由 R 使用来自 fromJSON() 的函数读取。它作为列表存储在 R 中。

# Load the package required to read JSON files.
library("rjson")

# Give the input file name to the function.
result <- fromJSON(file = "input.json")

# Print the result.
print(result)

# 当我们执行上面的代码，它产生以下结果：

$ID
[1] "1" "2" "3" "4" "5" "6" "7" "8"

$Name
[1] "Rick"     "Dan"      "Michelle" "Ryan"     "Gary"     "Nina"     "Simon"    "Guru"    

$Salary
[1] "623.3"  "515.2"  "611"    "729"    "843.25" "578"    "632.8"  "722.5" 

$StartDate
[1] "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" "2015-03-27" "2013-05-21" "2013-07-30" "2014-06-17"

$Dept
[1] "IT"         "Operations" "IT"         "HR"         "Finance"    "IT"         "Operations" "Finance"

将 JSON 转换为数据框

我们可以使用 as.data.frame() 函数将上面提取的数据转换为 R 语言数据框以进行进一步分析。

# Load the package required to read JSON files.
library("rjson")

# Give the input file name to the function.
result <- fromJSON(file = "input.json")

# Convert JSON file to a data frame.
json_data_frame <- as.data.frame(result)

print(json_data_frame)

# 当我们执行上面的代码，它产生以下结果：

  ID     Name Salary  StartDate       Dept
1  1     Rick  623.3 2012-01-01         IT
2  2      Dan  515.2 2013-09-23 Operations
3  3 Michelle    611 2014-11-15         IT
4  4     Ryan    729 2014-05-11         HR
5  5     Gary 843.25 2015-03-27    Finance
6  6     Nina    578 2013-05-21         IT
7  7    Simon  632.8 2013-07-30 Operations
8  8     Guru  722.5 2014-06-17    Finance

Web 数据

许多网站提供数据供其用户使用。例如，世界卫生组织（WHO）以 CSV，txt 和 XML 文件的形式提供健康和医疗信息的报告。使用 R 语言程序，我们可以从这些网站以编程方式提取特定数据。R 语言中用于从网站中提取数据的一些包是 RCurl，XML 和 stringr，它们用于连接到 URL，识别文件所需的链接并将它们下载到本地环境。

安装 R 语言的包处理 URL 和链接到文件需要以下的包。如果它们在 R 语言环境中不可用，您可以使用以下命令安装它们。

install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")

读取 Web 数据

我们将访问 URL 天气数据，并使用 R 下载 2015 年的 CSV 文件。

我们将使用函数 getHTMLLinks() 来收集文件的 URL。然后我们将使用函数downlaod.file() 将文件保存到本地系统。由于我们将对多个文件一次又一次地应用相同的代码，因此我们将创建一个被多次调用的函数。文件名作为参数以 R 列表对象的形式传递到此函数。

# Load the package required.
library("RCurl")
library("XML")
library("stringr")
library("plyr")

# Read the URL.
url <- "https://www.geos.ed.ac.uk/~weather/jcmb_ws/"

# Gather the html links present in the webpage.
links <- getHTMLLinks(url)

# Identify only the links which point to the JCMB 2015 files. 
filenames <- links[str_detect(links, "JCMB_2015")]

# Store the file names as a list.
filenames_list <- as.list(filenames)

# Create a function to download the files by passing the URL and filename list.
downloadcsv <- function (mainurl, filename) {
   filedetails <- str_c(mainurl, filename)
   download.file(filedetails, filename)
}

# Now apply the l_ply function and save the files into the current R working directory.
l_ply(filenames, downloadcsv, mainurl = "https://www.geos.ed.ac.uk/~weather/jcmb_ws/")

验证文件下载

运行上述代码后，您可以在当前 R 语言工作目录中找到以下文件。

JCMB_2015.csv
JCMB_2015_Apr.csv
JCMB_2015_Aug.csv
JCMB_2015_Dec.csv
JCMB_2015_Feb.csv
JCMB_2015_Jan.csv
JCMB_2015_Jul.csv
JCMB_2015_Jun.csv
JCMB_2015_Mar.csv
JCMB_2015_May.csv
JCMB_2015_Nov.csv
JCMB_2015_Oct.csv
JCMB_2015_Sep.csv

数据库

数据是关系数据库系统以规范化格式存储。因此，要进行统计计算，我们将需要非常先进和复杂的 SQL 查询。但 R 语言可以轻松地连接到许多关系数据库，如 MySQL，Oracle，SQL Server 等，并从它们获取记录作为数据框。一旦数据在 R 语言环境中可用，它就变成正常的 R 语言数据集，并且可以使用所有强大的包和函数来操作或分析。

在本教程中，我们将使用 MySQL 作为连接到 R 语言的参考数据库。

R 语言有一个名为 RMySQL 的内置包，它提供与 MySQL 数据库之间的本地连接。您可以使用以下命令在 R 语言环境中安装此软件包。

1	install.packages("RMySQL")

连接到 MySQL

一旦安装了包，我们在 R 中创建一个连接对象以连接到数据库。它使用用户名，密码，数据库名称和主机名作为输入。

# Create a connection Object to MySQL database.
# We will connect to the sampel database named "testdb" that comes with MySQL installation.
library("RMySQL")
conn = dbConnect(MySQL(), user = 'root', password = 'root', dbname = 'testdb', host = 'localhost', port=3306)

# Set the encoding method to gbk
dbSendQuery(conn, 'SET NAMES gbk')

# List the tables available in this database.
dbListTables(conn)

# 当我们执行上面的代码，它产生以下结果：

[1] "employee_tbl"    "tbl"    "tcount_tbl"    "transaction_test"

查询表数据

我们可以使用函数 dbSendQuery() 查询 MySQL 中的数据库表。查询在 MySQL 中执行，并使用 R 语言 fetch() 函数返回结果集。最后，它被存储为 R 语言中的数据框。

# Query the "tbl" tables to get all the rows.
result = dbSendQuery(conn, "SELECT * FROM tbl")

# Store the result in a R data frame object. n = 6 is used to fetch first 6 rows.
data.frame = fetch(result, n = 6)
print(data.frame)

# 当我们执行上面的代码，它产生以下结果：

  id       title author submission_date
1  1    学习 PHP    PHP      2021-01-12
2  2  学习 MySQL  MySQL      2021-01-12
3  3    学习 C++    C++      2021-01-01
4  4 学习 Python Python      2021-01-01
5  5  MySQL 教程  MySQL      2021-01-12
6  6   JAVA 教程   JAVA      2021-01-12

带过滤条件的查询

我们可以传递任何有效的 SELECT 查询来获取结果。

# Query the "tbl" tables to get all the rows with author equal to 'MySQL'.
result = dbSendQuery(conn, "SELECT * FROM tbl WHERE author = 'MySQL'")

# Fetch all the records and store it as a data frame.
data.frame = fetch(result)
print(data.frame)

# 当我们执行上面的代码，它产生以下结果：

  id      title author submission_date
1  2 学习 MySQL  MySQL      2021-01-12
2  5 MySQL 教程  MySQL      2021-01-12

更新表数据

我们可以通过将更新查询传递给 dbSendQuery() 函数来更新 MySQL 表中的行。

# Query the "tbl" tables to get the rows with id equal to 1.
result = dbSendQuery(conn, "SELECT * FROM tbl WHERE id = 1")

# Fetch all the records and store it as a data frame.
data.frame = fetch(result)
print(data.frame)

# Update the record with id equal to 1.
dbSendQuery(conn, "UPDATE tbl SET submission_date = '2000-01-12' WHERE id = 1")

# Query the "tbl" tables to get the rows with id equal to 1.
result = dbSendQuery(conn, "SELECT * FROM tbl WHERE id = 1")

# Fetch all the records and store it as a data frame.
data.frame = fetch(result)
print(data.frame)

# 在执行上面的代码后，我们可以看到在 MySQL 环境中更新的表。

  id    title author submission_date
1  1 学习 PHP    PHP      2021-01-12

  id    title author submission_date
1  1 学习 PHP    PHP      2000-01-12

向表插入数据

# Insert data to the "tbl" tables
dbSendQuery(conn,
   "INSERT INTO tbl (title, author, submission_date) VALUES('学习 R', 'R', '2021-01-22')"
)

# Query the "tbl" tables to get all the rows.
result = dbSendQuery(conn, "SELECT * FROM tbl")

# Fetch all the records and store it as a data frame.
data.frame = fetch(result)
print(data.frame)

# 在执行上面的代码后，我们可以看到插入到 MySQL 环境中的表中的行。

  id       title author submission_date
1  1    学习 PHP    PHP      2000-01-12
2  2  学习 MySQL  MySQL      2021-01-12
3  3    学习 C++    C++      2021-01-01
4  4 学习 Python Python      2021-01-01
5  5  MySQL 教程  MySQL      2021-01-12
6  6   JAVA 教程   JAVA      2021-01-12
7  7      学习 R      R      2021-01-22

创建表

我们可以在 MySQL 中使用函数 dbWriteTable() 创建表。如果表已经存在，它将覆盖该表，并将数据框用作输入。

# Create the connection object to the database where we want to create the table.
conn = dbConnect(MySQL(), user = 'root', password = 'root', dbname = 'testdb', host = 'localhost', port = '3306')

# Use the R data frame "mtcars" to create the table in MySQL.
# All the rows of mtcars are taken inot MySQL.
dbWriteTable(conn, "mtcars", mtcars[, ], overwrite = TRUE)

# List the tables available in this database.
dbListTables(conn)

# Query the "mtcars" tables to get the rows
result = dbSendQuery(conn, "SELECT * FROM mtcars")

# Fetch all the records and store it as a data frame.
data.frame = fetch(result)
print(data.frame)

# 执行上面的代码后，我们可以看到在 MySQL 环境中创建的表。

[1] TRUE

[1] "employee_tbl"    "tbl"    "tcount_tbl"    "transaction_test"

             row_names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
32          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

删除表

我们可以删除 MySQL 数据库中的表，将 DROP TABLE 语句传递到 dbSendQuery() 中，就像我们使用它查询表中的数据一样。

# List the tables available in this database.
dbListTables(conn)

# Delete the "mtcars" table.
dbSendQuery(conn, 'DROP TABLE IF EXISTS mtcars')

# List the tables available in this database.
dbListTables(conn)

# 执行上面的代码后，我们可以看到表在 MySQL 环境中被删除。

[1] "employee_tbl"     "mtcars"           "tbl"              "tcount_tbl"       "transaction_test"

[1] "employee_tbl"     "tbl"              "tcount_tbl"       "transaction_test"