ざっくり:
install.packages("tidyverse")
でインストール── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.1.0 ✓ dplyr 1.0.3
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
読み込まれるパッケージ
setwd()
を書いているものもありますが、RStudioプロジェクトでは必要ありません。
read.csv()
stringsAsFactors = TRUE
がデフォルトになっているので、stringsAsFactors = FALSE
をつけることを推奨します。readr::read_csv()
data.table::fread()
readr::read_csv()
よりも高速data.table = FALSE
read.delim()
read.delim()
は区切り値のファイルを読む標準関数read.csv()
は sep = ","
をつけたものreadr::read_tsv()
data.table::fread()
read.delim()
readr::read_delim()
data.table::fread()
read_***()
関数が一番オススメread.*** | read_*** | fread | |
---|---|---|---|
速さ(45MB) | 3秒 | 0.8 秒 | 0.6秒 |
区切り値の判定ミス | × | × | △ |
エンコーディング | ○ | ○ | △ |
haven パッケージで読み込み
Encoding=cp932
Encoding=UTF8
パッケージを使わないやり方より
2つのデータ形式(例: カテゴリごとの購買金額(千円))
df <- tibble::tibble("country" = c("a", "b", "c"),
"1999" = c(0.7, 0.3, 1.0),
"2000" = c(1.0, 2.0, 4.8),
"2001" = c(2.0, 5.0, 7.0))
df
# A tibble: 3 x 4
country `1999` `2000` `2001`
<chr> <dbl> <dbl> <dbl>
1 a 0.7 1 2
2 b 0.3 2 5
3 c 1 4.8 7
# A tibble: 9 x 3
country year amount
<chr> <chr> <dbl>
1 a 1999 0.7
2 a 2000 1
3 a 2001 2
4 b 1999 0.3
5 b 2000 2
6 b 2001 5
7 c 1999 1
8 c 2000 4.8
9 c 2001 7
# A tibble: 3 x 4
country `1999` `2000` `2001`
<chr> <dbl> <dbl> <dbl>
1 a 0.7 1 2
2 b 0.3 2 5
3 c 1 4.8 7
dat_m <- tibble::tibble(user = c('A', 'B', 'C'),
category_1 = c(10, 15, 8),
category_2 = c(2, 4, 5),
subject_1 = c(4, 5, 6),
subject_2 = c(5, 6, 7))
dat_m
# A tibble: 3 x 5
user category_1 category_2 subject_1 subject_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 10 2 4 5
2 B 15 4 5 6
3 C 8 5 6 7
dat_long <- dat_m %>%
pivot_longer(cols = -user,
names_to = c("group", "num"),
names_sep = "_")
dat_long
# A tibble: 12 x 4
user group num value
<chr> <chr> <chr> <dbl>
1 A category 1 10
2 A category 2 2
3 A subject 1 4
4 A subject 2 5
5 B category 1 15
6 B category 2 4
7 B subject 1 5
8 B subject 2 6
9 C category 1 8
10 C category 2 5
11 C subject 1 6
12 C subject 2 7
Tokyo.R #79 の応用セッション を参照。
[1] 6
結果 <- スタート地点
を書いて、やりたい処理をパイプでつないでいくCtrl
+ Shift
+ M
Cmd
+ Shift
+ M
git clone https://github.com/ymattu/sampledata_small
例
select(product, 1:3) # 列番号が連続している場合
select(product, ProductID:Price) # 列名でも連続していれば同様
select(product, -CreatedDate) # 特定の列を除く
select(product, -4) # 特定の列番号を除く
select(product, starts_with("p")) # "p"で始まる列のみを選択
select(product, starts_with("p"), ignore.case = TRUE) # 大文字小文字を無視
select(product, matches("^(Product|Price)")) # "Product"または"Price"で始まる列を選択
$
はお手軽だしよく使います。$
で 1 列だけ取り出す[1] "雑貨・日用品" "花・グリーン" "食品"
[4] "衣料品" "ヘルス&ビューティー" "家具・インテリア・家電"
たくさんあるけど例えば
ユーザー、年ごとに集計
inner_join()
left_join()
full_join()
anti_join()
Answer : C++を使っているから
例えばこんな感じ(a, b, c 3 つのデータ)
x1 x2
1 A 1
2 B 2
3 C 3
x1 x3
1 A TRUE
2 B FALSE
3 D TRUE
x1 x4
1 B 10
2 C 11
3 D 12
Answer : 少し応用的ですが、purrrパッケージを使うと簡単です。
purrr パッケージの参考資料→そろそろ手を出す purrr
summarise()
の挙動変化rowwise()
関数についてacross()
, where()
)についてRows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Retur…
$ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Im…
$ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1"…
summarise()
の挙動変化についてsummarise()
のはたらきは?# A tibble: 6 x 14
# Groups: species, homeworld [4]
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, grey light blue 52 male mascu…
# … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
starwars_summary1 <- grouped_starwars %>%
summarise(mean_height = mean(height, na.rm = TRUE))
starwars_summary1
# A tibble: 58 x 3
# Groups: species [38]
species homeworld mean_height
<chr> <chr> <dbl>
1 Aleena Aleen Minor 79
2 Besalisk Ojom 198
3 Cerean Cerea 198
4 Chagrian Champala 196
5 Clawdite Zolan 168
6 Droid Naboo 96
7 Droid Tatooine 132
8 Droid <NA> 148
9 Dug Malastare 112
10 Ewok Endor 88
11 Geonosian Geonosis 183
12 Gungan Naboo 209.
13 Human Alderaan 176.
14 Human Bespin 175
15 Human Bestine IV 180
16 Human Chandrila 150
17 Human Concord Dawn 183
18 Human Corellia 175
19 Human Coruscant 168.
20 Human Eriadu 180
21 Human Haruun Kal 188
22 Human Kamino 183
23 Human Naboo 168.
24 Human Serenno 193
25 Human Socorro 177
26 Human Stewjon 182
27 Human Tatooine 179.
28 Human <NA> 193
29 Hutt Nal Hutta 175
30 Iktotchi Iktotch 188
31 Kaleesh Kalee 216
32 Kaminoan Kamino 221
33 Kel Dor Dorin 188
34 Mirialan Mirial 168
35 Mon Calamari Mon Cala 180
36 Muun Muunilinst 191
37 Nautolan Glee Anselm 196
38 Neimodian Cato Neimoidia 191
39 Pau'an Utapau 206
40 Quermian Quermia 264
41 Rodian Rodia 173
42 Skakoan Skako 193
43 Sullustan Sullust 160
44 Tholothian Coruscant 184
45 Togruta Shili 178
46 Toong Tund 163
47 Toydarian Toydaria 137
48 Trandoshan Trandosha 190
49 Twi'lek Ryloth 179
50 Vulptereen Vulpter 94
51 Wookiee Kashyyyk 231
52 Xexto Troiken 122
53 Yoda's species <NA> 66
54 Zabrak Dathomir 175
55 Zabrak Iridonia 171
56 <NA> Naboo 183
57 <NA> Umbara 178
58 <NA> <NA> NaN
mean()
では、グループ化された中での平均を計算してしまう# A tibble: 58 x 4
# Groups: species [38]
species homeworld mean_height rel_heigt
<chr> <chr> <dbl> <dbl>
1 Aleena Aleen Minor 79 1
2 Besalisk Ojom 198 1
3 Cerean Cerea 198 1
4 Chagrian Champala 196 1
5 Clawdite Zolan 168 1
6 Droid Naboo 96 0.766
7 Droid Tatooine 132 1.05
8 Droid <NA> 148 1.18
9 Dug Malastare 112 1
10 Ewok Endor 88 1
11 Geonosian Geonosis 183 1
12 Gungan Naboo 209. 1
13 Human Alderaan 176. 0.989
14 Human Bespin 175 0.982
15 Human Bestine IV 180 1.01
16 Human Chandrila 150 0.842
17 Human Concord Dawn 183 1.03
18 Human Corellia 175 0.982
19 Human Coruscant 168. 0.945
20 Human Eriadu 180 1.01
21 Human Haruun Kal 188 1.05
22 Human Kamino 183 1.03
23 Human Naboo 168. 0.945
24 Human Serenno 193 1.08
25 Human Socorro 177 0.993
26 Human Stewjon 182 1.02
27 Human Tatooine 179. 1.01
28 Human <NA> 193 1.08
29 Hutt Nal Hutta 175 1
30 Iktotchi Iktotch 188 1
31 Kaleesh Kalee 216 1
32 Kaminoan Kamino 221 1
33 Kel Dor Dorin 188 1
34 Mirialan Mirial 168 1
35 Mon Calamari Mon Cala 180 1
36 Muun Muunilinst 191 1
37 Nautolan Glee Anselm 196 1
38 Neimodian Cato Neimoidia 191 1
39 Pau'an Utapau 206 1
40 Quermian Quermia 264 1
41 Rodian Rodia 173 1
42 Skakoan Skako 193 1
43 Sullustan Sullust 160 1
44 Tholothian Coruscant 184 1
45 Togruta Shili 178 1
46 Toong Tund 163 1
47 Toydarian Toydaria 137 1
48 Trandoshan Trandosha 190 1
49 Twi'lek Ryloth 179 1
50 Vulptereen Vulpter 94 1
51 Wookiee Kashyyyk 231 1
52 Xexto Troiken 122 1
53 Yoda's species <NA> 66 1
54 Zabrak Dathomir 175 1.01
55 Zabrak Iridonia 171 0.988
56 <NA> Naboo 183 NaN
57 <NA> Umbara 178 NaN
58 <NA> <NA> NaN NaN
ungroup()
されたのが正しい結果# A tibble: 58 x 4
species homeworld mean_height rel_heigt
<chr> <chr> <dbl> <dbl>
1 Aleena Aleen Minor 79 NaN
2 Besalisk Ojom 198 NaN
3 Cerean Cerea 198 NaN
4 Chagrian Champala 196 NaN
5 Clawdite Zolan 168 NaN
6 Droid Naboo 96 NaN
7 Droid Tatooine 132 NaN
8 Droid <NA> 148 NaN
9 Dug Malastare 112 NaN
10 Ewok Endor 88 NaN
11 Geonosian Geonosis 183 NaN
12 Gungan Naboo 209. NaN
13 Human Alderaan 176. NaN
14 Human Bespin 175 NaN
15 Human Bestine IV 180 NaN
16 Human Chandrila 150 NaN
17 Human Concord Dawn 183 NaN
18 Human Corellia 175 NaN
19 Human Coruscant 168. NaN
20 Human Eriadu 180 NaN
21 Human Haruun Kal 188 NaN
22 Human Kamino 183 NaN
23 Human Naboo 168. NaN
24 Human Serenno 193 NaN
25 Human Socorro 177 NaN
26 Human Stewjon 182 NaN
27 Human Tatooine 179. NaN
28 Human <NA> 193 NaN
29 Hutt Nal Hutta 175 NaN
30 Iktotchi Iktotch 188 NaN
31 Kaleesh Kalee 216 NaN
32 Kaminoan Kamino 221 NaN
33 Kel Dor Dorin 188 NaN
34 Mirialan Mirial 168 NaN
35 Mon Calamari Mon Cala 180 NaN
36 Muun Muunilinst 191 NaN
37 Nautolan Glee Anselm 196 NaN
38 Neimodian Cato Neimoidia 191 NaN
39 Pau'an Utapau 206 NaN
40 Quermian Quermia 264 NaN
41 Rodian Rodia 173 NaN
42 Skakoan Skako 193 NaN
43 Sullustan Sullust 160 NaN
44 Tholothian Coruscant 184 NaN
45 Togruta Shili 178 NaN
46 Toong Tund 163 NaN
47 Toydarian Toydaria 137 NaN
48 Trandoshan Trandosha 190 NaN
49 Twi'lek Ryloth 179 NaN
50 Vulptereen Vulpter 94 NaN
51 Wookiee Kashyyyk 231 NaN
52 Xexto Troiken 122 NaN
53 Yoda's species <NA> 66 NaN
54 Zabrak Dathomir 175 NaN
55 Zabrak Iridonia 171 NaN
56 <NA> Naboo 183 NaN
57 <NA> Umbara 178 NaN
58 <NA> <NA> NaN NaN
summarise()
のはたらきは?grouped_starwars %>%
summarise(q_height = quantile(height, na.rm = TRUE))
#> Error: Column `dep_height` must be length 1 (a summary value), not 5
quantile()
は分位数を返す集約関数だが、出力は5つ
0% 25% 50% 75% 100%
66 167 180 191 264
summarise()
の新機能1summarise()
の新機能2summarise()
の新機能3grouped_starwars %>%
summarise(q_height = quantile(height, na.rm = TRUE))
# A tibble: 290 x 3
# Groups: species, homeworld [58]
# species homeworld dep_height
# <chr> <chr> <dbl>
# 1 Aleena Aleen Minor 79
# 2 Aleena Aleen Minor 79
# 3 Aleena Aleen Minor 79
# 4 Aleena Aleen Minor 79
# 5 Aleena Aleen Minor 79
# 6 Besalisk Ojom 198
# 7 Besalisk Ojom 198
# 8 Besalisk Ojom 198
# 9 Besalisk Ojom 198
# 10 Besalisk Ojom 198
# … with 280 more rows
grouped_starwars %>%
summarise(q = c(0, 0.25, 0.5, 0.75, 1),
q_height = quantile(height, na.rm = TRUE))
# A tibble: 290 x 4
# Groups: species, homeworld [58]
# species homeworld q q_height
# <chr> <chr> <dbl> <dbl>
# 1 Aleena Aleen Minor 0 79
# 2 Aleena Aleen Minor 0.25 79
# 3 Aleena Aleen Minor 0.5 79
# 4 Aleena Aleen Minor 0.75 79
# 5 Aleena Aleen Minor 1 79
# 6 Besalisk Ojom 0 198
# 7 Besalisk Ojom 0.25 198
# 8 Besalisk Ojom 0.5 198
# 9 Besalisk Ojom 0.75 198
# 10 Besalisk Ojom 1 198
# … with 280 more rows
across()
, where()
について***_at()
の例
starwars %>%
group_by(species) %>%
filter(n() > 1) %>%
summarise_at(vars(gender, homeworld),
n_distinct)
# A tibble: 9 x 3
species gender homeworld
<chr> <int> <int>
1 Droid 2 3
2 Gungan 1 1
3 Human 2 16
4 Kaminoan 2 1
5 Mirialan 1 1
6 Twi'lek 2 1
7 Wookiee 1 1
8 Zabrak 1 2
9 <NA> 1 3
***_at()
の他に、
***_all()
***_if()
across()
, where()
を使用***_at()
, `***_aif()
, ***_all()
がなくなるわけではない対象の列 | dplyr < 1.0.0 | dplyr >= 1.0.0 |
---|---|---|
列名を指定 | ***_at | ***(across(列の指定, 処理)) |
特定の型を持つ列 | ***_if | ***(across(where(列の型の指定), 処理)) |
全ての列 | ***_all | ***(across(everything(), 処理)) |
rowwise()
について x y
1 1 3
2 2 3
3 3 3
4 1 2
5 2 2
6 3 2
7 1 1
8 2 1
9 3 1
x y m
1 1 3 2
2 2 3 2
3 3 3 2
4 1 2 2
5 2 2 2
6 3 2 2
7 1 1 2
8 2 1 2
9 3 1 2
rowwise()
でこれを解消# A tibble: 9 x 3
# Rowwise:
x y m
<int> <int> <dbl>
1 1 3 2
2 2 3 2.5
3 3 3 3
4 1 2 1.5
5 2 2 2
6 3 2 2.5
7 1 1 1
8 2 1 1.5
9 3 1 2
rowwise()
の歴史