R bootcamp for lexical semanticists

# R bootcamp for lexical semanticists
## 5: <code>final frontier</code>
<html>
<div style="float:left">

</div>
<hr color='#875712' size=1px width=796px>
</html>
### Thomas Van Hoey 司馬智
### <br><br>5 December 2019<br>National Taiwan University

---

# Recap of last last week

---

# What we did last time

## Skills

* ✔️ scraping stuff from static websites with .orange[rvest]
* ✔️ .orange[quanteda] package .grey[(easier corpus but we needed .orange[tidytext] first)]
* ❌ looking at some .blue[.xml files]

## Datasets

* ✔️ a .blue[website] 
* ✔️.blue[PTT] .blue[(this is your homework)]
* ❌  (a  part of the) .blue[BNC corpus (in xml)]

## Case studies

Getting a bunch of data from the internet, and showing how you can start exploring it (before you analyze it yourself).

---

# This time

This will probably be the last session in the Rbootcamp for lexical semanticists.

## Skills

* xml-files
* segmentation of Chinese (.orange[jiebaR]) + how I do it
* stop words

## Data

* BNC

## Theory

* Association measures à la Gries

---

# Setup of today

---

# Setup

## Make a new markdown document in your project

Called "jieba.Rmd"

## Packages

```r
library(tidyverse)
library(here)
```

```r
library(quanteda)
library(jiebaR) # install.packages("jiebaR") # segmenting Chinese

library(tmcn) # stopwords and other functions for Chinese
```

---

# Segmenting Chinese

So in the past we have seen that .orange[tidytext] and .orange[quanteda] use the same library underneath to make some segmentation guesses for Chinese.

.orange[quanteda] is a bit more powerful, because you can glue words back together if they have been split.

However, what most people recommend for .blue[basic] segmentation is called .blue[jieba]. It comes with a [manual in Chinese](https://qinwenfeng.com/jiebaR/), so check that out later.

Jieba is originally a python package, but the R variant is available and called .orange[jiebaR].

---

# jiebaR

The basic thing you need to do is define a .red[worker()], and then you can .red[segment()] stuff.

```r
cutter <- worker()
segment("你好最近還可以嗎", cutter)
```

---

# jiebaR

Let's cut a table as well:

```r
cutter$bylines = TRUE
cutter

tribble(
  ~a,
  "你好最近還可以嗎", 
  "台灣大學現在很冷"
) %>%
  mutate(a = segment(a, cutter)) %>%
  unnest(a)
```

The resulting view should be familiar from .orange[tidytext].

---

# jiebaR

But you are probably most interested in how to segment a bunch of files, right?

As always we **first do 1 file**.

```r
file <- read_lines(here("inputfiles", "周杰倫_晴天.txt"))

cutter$bylines = TRUE

segment(file, cutter) -> segmented_tokens

segmented_tokens %>% 
  enframe() %>% 
  unnest(value) %>% # this is the unnest_tokens view
  ungroup() %>% # this is weird but first ungroup then group_by
  group_by(name) %>%
  summarise(sent = str_c(value, collapse = " ", sep = "")) %>%
  pull(sent) %>%
  str_squish() -> segmented_spaces

segmented_spaces %>%
  write_lines(here("output", "周杰倫_晴天.txt"))
```

---

# jiebaR

Now we wrap that into a function, just as we have done in the past few weeks:

```r
tokens_to_spaces <- function(segmented_tokens){
  segmented_spaces <- segmented_tokens %>% 
  enframe() %>% 
  unnest(value) %>% # this is the unnest_tokens view
  ungroup() %>% # this is weird but first ungroup then group_by
  group_by(name) %>%
  summarise(sent = str_c(value, collapse = " ", sep = "")) %>%
  pull(sent) %>%
  str_squish() 
  
  segmented_spaces
}

tokens_to_spaces(segmented_tokens = segmented_tokens)
```

---

# jiebaR

## more

It should be the same.

This function you can call with the familiar steps:
1. listing all files that participate with .orange[ls::].red[dir_ls] with options .blue[recurse = TRUE, regexp = ".txt$"]
2. Iterate with the .orange[purrr::].red[map/walk] family of functions.

## dictionary

There should be an option for custom dictionaries, that can help you segment better (e.g. you upload a list of things you definitely want segmented in a certain way).

But I don't understand the manual. Maybe you do (in Chinese). If you manage to make it work, please tell me how.

http://qinwenfeng.com/jiebaR/

---

# How I segment

I wrote a whole [blogpost](https://www.thomasvanhoey.com/post/guanguan-goes-the-chinese-word-segmentation-2/) about my troubles and successes with segmentation.

Currently I use a combination of R and python, through the .orange[reticulate] package and the python .orange[ckiptagger] library. You can find a demonstration [here](https://github.com/ckiplab/ckiptagger).

If you are interested in this, I can demonstrate it at some point.

## Conclusion

So for now the best way is either to use this way (.orange[jiebaR]) to segment your files, or directly trust .orange[tidytext] or .orange[quanteda].

There is no 100% guaranteed word segmentation tool, as far as I know.

But last week Hsieh laoshi said in his talk that we should not be looking for "the golden standard" anymore, but rather focus on levels of granularity. And I think that pragmatic mindset is good: computational stuff can help you a lot, but it can't do everything.

---

# Stop words

---

# Stop words

Let's make a resource that is useful for us later: a list of [stop words](https://en.wikipedia.org/wiki/Stop_words) in Chinese.

So words like {我, 與, 是... } are things we would like to list, so we can focus more on content words.
(Although I think that is a naive way of doing linguistics, but it has its place in things like wordclouds etc.)

---

# Lists of stopwords

## Quanteda

The first list comes from the .orange[quanteda] package, which we loaded above.

```r
ch_stop <- stopwords("zh", source = "misc") %>% enframe(name = NULL)
head(ch_stop)
nrow(ch_stop)
#ch_stop %>% distinct(value)
```

## tmcn

The second comes from the .orange[tmcn] package.

```r
data("STOPWORDS")
tmcn_stop <- stopwordsCN() %>% enframe(name = NULL)
head(tmcn_stop)
nrow(tmcn_stop)
```

---

# Lists of stopwords

## Combining them

```r
full_join(ch_stop, tmcn_stop, by = "value") %>% 
  distinct(value) -> stopwords

stopwords

stopwords %>%
  rename(simp = value) %>%
  mutate(trad = tmcn::toTrad(simp)) -> stopwords # traditional characters
```

So now we have a nice set of stopwords.
You can save it somewhere for future usage :).

```r
write_csv(stopwords, here("stopwords.csv"))
```

---

# Collexemes and collostructions

---

# Papers by Gries and friends

.font70[
Stefanowitsch, Anatol & Stefan Th. Gries. 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2). 209–243. doi:10.1075/ijcl.8.2.03ste.

Gries, Stefan Th. & Anatol Stefanowitsch. 2004. Covarying collexemes in the into-causative.

Stefanowitsch, Anatol. 2005. New York, Dayton (Ohio), and the raw frequency fallacy. Corpus Linguistics and Linguistic Theory 1(2). 295–301.

Colleman, Timothy. 2009. The semantic range of the Dutch double object construction: A collostructional perspective. Constructions and Frames 1(2). 190–220. doi:10.1075/cf.1.2.02col.

Schmid, Hans-Jörg. 2010. Does frequency in text instantiate entrenchment in the cognitive system?

Gries, Stefan Th. 2012. Frequencies, probabilities, and association measures in usage-/exemplar-based linguistics. Studies in Language 11(3). 477–510.

Hilpert, Martin. 2012. Diachronic collostructional analysis: How to use it and how to deal with confounding factors.

Schmid, Hans-Jörg & Helmut Küchenhoff. 2013. Collostructional analysis and other ways of measuring lexicogrammatical attraction: Theoretical premises, practical problems and cognitive underpinnings. Cognitive Linguistics 24(3). 531–577. doi:10.1515/cog-2013-0018.

Gries, Stefan Th. 2015. More (old and new) misunderstandings of collostructional analysis: On Schmid and Küchenhoff (2013). Cognitive Linguistics 26(3). 505–536. doi:10.1515/cog-2014-0092.

Levshina, Natalia. 2015. How to do linguistics with R: data exploration and statistical analysis.

Gries, Stefan Th. 2019. 15 years of collostructions: Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics 24(3). 385–412. doi:10.1075/ijcl.00011.gri.
]

---

# Collostructions and collexeme analysis

Here are the basic ideas:

Raw absolute token frequencies give you *an impression* of what collocations are important.

BUT to have a real analysis, you need to turn them into relative frequencies, because who is to say that very frequent elements (the verbs TO BE, TO HAVE, A/AN, THE...) just occur a lot in your sample because they just occur a lot, rather than being .blue[significant] for the construction you are looking at?

---

# Collostructions and collexeme analysis

Example: .red[N waiting to happen]

![](./figs/accident.png)

So you need to calculate somehow how significant the N in .red[N waiting to happen] is, right.

---

# The methodology

1. Get a corpus
2. Count all instances of the phemonenon you want to investigate 
3. Count all instances of categories that participate in your phenomenon 
4. Do letter math with your  contingency table (choose a small set at first)
5. Perform statistical tests

So you want to make a contingency table like this:

| Element 2  | Not element 2 / other elements | Sum
-- |--|--
**Element 1**                     | a         |        b                       | a + b |
**Not element 1 / other elements** | c         |     d                          | c + d
**Sum**                            | a + c     | b + d                  | a + b + c + d

---

# The methodology

an example from Stefanowitsch & Gries (2003):

| accident | not accident | Sum
----------------------------------|-----------|--------------------------------|---
N waiting to happen                     | 14         |        21                       | 35
not N waiting to happen | 8,606         |     10,197,659                          | 10,206,265
**Sum**                            | 8,620    | 10,197,680                  | 10,206,300

---

# Statistical tests

![](./figs/AM.png)

Levshina (2015) provides this nice list of calculations we can do (see also my [blog on tidy collostructions](https://www.thomasvanhoey.com/post/tidy-collostructions/) -- basically what we are doing today).

---

# Some more methods

Basically three big approaches within collostruction analysis:

1. .blue[collexeme analysis] (Stefanowitsch & Gries, 2003), whose purpose it is to quan- tify how much words that occur in a syntactically defined slot of a construction are attracted to or repelled by that construction; examples include the verbs that occur in the ditransitive, the imperative, or the nouns that occur in the .red[N waiting-to-happen construction]
2. .blue[distinctive collexeme analysis] (Gries & Stefanowitsch, 2004a), whose purpose it is to quantify how much words prefer to occur in slots of two functionally similar constructions; examples include the .red[ditransitive vs. the prepositional dative] or the .red[will vs. the going-to future] 
3. .blue[co-varying collexeme analysis] (Gries & Stefanowitsch, 2004b), whose purpose is to quantify how much words in one slot of a construction are attracted to or repelled by words in a second slot of the same construction; examples include the two verb slots in the .red[into-causative] .grey[(trickV1 someone into buyingV2 or forceV1 someone into acceptingV2)] or the verb and the preposition of the .red[way-construction] .grey[(weaveV your way throughPrep the crowd or makeV your way toPrep the top)].

The last one .red[into-causative] brings us to the case study of our R tutorial today.
It will be a bit more theoretical in the beginning, because it takes a while to run on the whole BNC.

---

# The into causative

We are going to recreate the casestudy of the *into* causative Stefanowitsch & Gries (2003:224-227).

`$$S_{agent} V O_{patient/agent} into A_{gerund}^{resulting action}$$`
 ]
 
Examples:
 
a. He tricked me into employing him.

b. They were forced into formulating an opinion.

c. We conned a grown-up into buying the tickets.

---

# The issue

Hunston & Francis (2000) had provided some .blue[impressionistic] material about raw frequency data, making them conclude that there is

> a strong tendency of the construction to occur with verbs denoting negative emotions (e.g. frighten, intimidate, panic, scare, terrify, embarrass, shock, shame etc.) or ways of speaking cleverly and deviously (e.g. talk, coax, cajole, charm, browbeat etc.). They propose that verbs entering into the into-causative usually (i) do not mean ‘talk reasonably’ and (ii) can also be used transitively; they go on to argue that both of the senses they have identified are associated with “some kind of forcefulness or even coercion” (Hunston & Francis 2000:106).

Furthermore:
1. they only discuss the V slot (and we will too today)
2. the verbs *force* and *coerce* are absent from their discussion, even though they talk about it.

---

# What we should find

![](./figs/intocausativeGries.png)

---

# Plan of attack

0. Make a new Rmarkdown 'gries.Rmd'
1. I show how I found all instances of the V in the .red[V into V-ing construction]
2. I show how I found all Verbs in the British National Corpus
3. We load in the files I created, and perform the association tests.
4. Short discussion

---

# New Rmarkdown

It's called 'Gries.Rmd'

Restart the R session

```r
library(tidyverse)
library(here)
library(xml2) # install.packages("xml2")
```

---

# xml

A short demonstration of xml, which works in the same way as the html-documents from last time (there is a big tree and you need to find the right nodes).

Basically, XML is hell. But we will open a file, and eventually turn it into a flat string, so we can use our regular expression knowledge to get stuff. It is not beautiful and very painful, but you can get stuff from it.

```r
# This is the path to where I parked my BNC file
files <- "/Users/Thomas/Desktop/BNC/A/A0/A00.xml"

xmlread <- read_xml(files) # read in xml
xmlread

xmlsentence <- xml_find_all(xmlread, "//s") # find all sentences
xmlsentence

xml.sentencecharacter <- tolower(as.character(xmlsentence)) # turn to lower case after turning every xml to a character string
head(xml.sentencecharacter, 2) 
```

---

# Getting what we want

For a corpus like the BNC, the tagset is your best friend

http://www.natcorp.ox.ac.uk/docs/c5spec.html

Important is also the use of .blue[[^<]] as a character that acts as .blue[.] usually works, but with a little bit more restriction.

So this is what I came up with

```r
VERB <- "<w[^<]+pos=\"verb\"[^<]+</w>"
INTO <- "<w[^<]+pos=\"prep\">into </w>"
VERBGERUND <- "<w c5=\"v[bdhv]g\"[^<]+</w>" # all the different ing forms
INBETWEEN <- "<[^<]+<[^<]+" # this is one tag!
SPACE <- "[^<]"

search.expression <- glue::glue("{VERB}(\n  )?{INTO}")

search.expression

se <- glue::glue("{VERB}{SPACE}+({INBETWEEN})+{INTO}{SPACE}+{VERBGERUND}")
se
```

---

# Iterating over the whole BNC

```r
collexeme.preparer <- function(files) {
  number <- basename(files)
  
  # this we just did
  xmlread <- read_xml(files) # read in xml
  xmlsentence <- xml_find_all(xmlread, "//s") # find all sentences
  xml.sentencecharacter <- tolower(as.character(xmlsentence)) # turn to lower case after turning every xml to a character string
  
  # this is new 
  extracted.modal.verb  <- xml.sentencecharacter %>%
    str_extract(se) %>% # getting into   
    tibble(.name_repair = ~ "strings") %>%
    mutate(name = number) %>%
    drop_na(strings)
  
  write_csv(extracted.modal.verb, 
            here::here("into.csv"),
            append = TRUE)
  
}
```

---

# Iterating over the whole BNC

```r
filelist <- fs::dir_ls(here::here("BNC"),
                       recurse = TRUE,
                       regexp = ".xml$")
length(filelist)

# this is where you really iterate
future_walk(filelist, collexeme.preparer, .progress = TRUE)
```

I used a package .orange[furrr], which uses the power of .orange[future] to make more use of your computer. Just google it if you want to do this.
But we won't do it today, because it takes a while.

---

# So we got the V into V-ing

We got the V into V-ing constructions.

But you need more frequencies to fill up the contingency table.

| Element 2 | Not element 2 / other elements | Sum
----------------------------------|-----------|--------------------------------|---
**Element 1**                     | .blue[a]         |        b                       | a + b
**Not element 1 / other elements** | c         |     d                          | c + d
**Sum**                            | a + c     | b + d                  | .blue[a + b + c + d]

---

# Getting all Verbs form the BNC

Basically, it is the same script:

```r
* VERB <- "<w[^<]+pos=\"verb\"[^<]+</w>"

collexeme.preparer <- function(files) {
  number <- basename(files)
  
  xmlread <- read_xml(files) # read in xml
  xmlsentence <- xml_find_all(xmlread, "//s") # find all sentences
  xml.sentencecharacter <- tolower(as.character(xmlsentence)) # turn to lower case after turning every xml to a character string
  
  extracted.modal.verb  <- xml.sentencecharacter %>%
    str_extract(VERB) %>% # getting modal verb followed by infinitive  
    tibble(.name_repair = ~ "strings") %>%
    mutate(name = number) %>%
*    mutate(verb = str_extract(strings, '(?<=hw=")\\w+')) %>%
    drop_na(strings)
  
  write_csv(extracted.modal.verb, 
            here::here("allverbsinBNC.csv"),
            append = TRUE)
  
}

future_walk(filelist, collexeme.preparer, .progress = TRUE)
```

---

# Some post processing

If you want to do this at home (but you can just use the end file, see next slide)

```r
freqlist <- read_csv(filelist, col_names = FALSE)

freqlist %>%
  filter(str_detect(X1, 'pos="verb"')) -> firstfilter

firstfilter %>% 
  filter(str_detect(X1, "mislead"))
  mutate(verb = str_extract(X1, 'hw="\\w+'),
         verb = str_remove(verb, 'hw="')) %>%
  select(verb) %>%
  count(verb) %>%
  filter(n > 10) %>%
  rename(freqall = n) -> freqofverbs
  
freqofverbs
write_csv(freqofverbs, here("bncallverbs.csv"))
```

---

# Time for further analysis.

Let's download the csv file with all frequencies for verbs from my github, where I parked it, and check it.

```r
freqofverbs <- read_csv("https://raw.githubusercontent.com/simazhi/sinologica/master/static/Rbootcamp/inputfiles/bncallverbs.csv") %>%
  mutate(freqall = as.integer(freqall))

freqofverbs %>%
  filter(verb == "trick")
```

We can already get .blue[a + b + c + d]!

```r
abcd <- freqofverbs %>%
  summarise(sum = sum(freqall)) %>%
  pull()
abcd 
```

---

# The into frequencies

This one isn't so pretty:

```r
intos <- read_csv("https://raw.githubusercontent.com/simazhi/sinologica/master/static/Rbootcamp/inputfiles/into.csv", col_names = FALSE)

intos %>%
  mutate(X1 = str_replace_all(X1, "\n", "@")) %>% # hidden \n s
  mutate(last = str_extract(X1, '<.+verb.+verb.+$')) %>% # we want the end
  select(last) %>%
  mutate(
    # first make the gerund in a few steps
    gerund = str_extract(last, 'c5="v[bdhv]g.+$'), 
    gerund = str_extract(gerund, '(?<=hw=")\\w+'),
    # make the V into
    verb = str_remove_all(last, 'into.+$'),
    verb = str_extract(verb, '<[^<]+pos="verb"(?!.+pos="verb".+).+</w>'),
    verb = str_extract(verb, '(?<=hw=")\\w+')) %>%
  select(verb, gerund) -> occurrences

occurrences
```

---

# Join the two tables

First counting, then joining.

```r
occurrences %>%
  count(verb, sort = TRUE) %>%
  left_join(freqofverbs, by = c("verb" = "verb"))
```

I noticed that there is something weird going on, because mislead occurs 42 times in the subset and 33 times in the whole set; trick 69 vs. 65 times.

So probably the regular expression can be improved, and you should definitely find out what is wrong if this is a real study, but for today, I am going to cheat a little bit and say that if freqall is smaller than n, then I am going to say it is the same.

---

# Join the two tables

```r
occurrences %>%
  count(verb, sort = TRUE) %>%
  left_join(freqofverbs, by = c("verb" = "verb")) %>%
  mutate(freqall = case_when(
    n > freqall ~ n,
    TRUE ~ freqall
  )) -> finished.df

finished.df
```

So here you can see that *come*, *go* occur very often into this .red[V into V-ing] construction, e.g. *come into being*, *went into hiding*. But they might just occur a lot because they are frequent verbs.
So that's why we need the association measures.

---

# Contingency table

Let's look at .blue[trick].

| trick | trick NOT into | Sum
---------------------------------|-----------|--------------------------------|---
**into**                     | a         |        b                       | a + b
**NOT trick into** | c         |     d                          | c + d
**Sum**                            | a + c     | b + d                  | a + b + c + d

---

# What do we know?

* a (trick into)
* a+b (freqall)

* a+b+c+d (the sum of freqall)

We can calculate the rest as follows, first shown for .blue[trick] and .blue[talk]

```r
finished.df %>% 
  filter(str_detect(verb, "trick|talk")) %>%
  rename(a = n,
         ab = freqall) %>%
  mutate(b = ab - a) %>%
  mutate(ac = sum(a)) %>%
  mutate(c = ac - a) %>%
  mutate(abcd = abcd) %>% # we know abcd from above
  mutate(cd = abcd - ab) %>% 
  mutate(d = cd - c) %>%
  select(verb, a, b, c, d, abcd)
```

---

# Big table for all verbs

Now the same but without the filter:

```r
finished.df %>% 
  #filter(str_detect(verb, "trick|talk")) %>%
  rename(a = n,
         ab = freqall) %>%
  mutate(b = ab - a) %>%
  mutate(ac = sum(a)) %>%
  mutate(c = ac - a) %>%
  mutate(abcd = abcd) %>% # we know abcd from above
  mutate(cd = abcd - ab) %>% 
  mutate(d = cd - c) %>%
  mutate(aexp = ab * ac / abcd) %>%
  select(verb, a, b, c, d, abcd, aexp) -> preassociation
preassociation
```

---

# Time for testing

As mentioned above, you can do many different tests -- it is up to the researcher to choose a good one.
You can also do multiple ones, just by using .red[mutate()] and the .blue[letter math] formula.

I chose cue validity because it is closely related to the one Stefanowitsch & Gries showed:

`$$\Delta P_{word \to construction} = cue_{construction} = \frac{a}{a + c} - \frac{b}{b + d}$$`

---

# Cue validity

Given a word, how strongly does it attract the construction:

```r
preassociation %>%
  mutate(dP.cueCx = a/(a + c) - b/(b + d),
         dP.cueVerb = a/(a + b) - c/(c + d)) %>%
  arrange(desc(dP.cueVerb)) %>%
  top_n(20) 
```

Then this is what you analyze with the methods we are used to (relation to theoretical frameworks etc.)

Let's see what Stefanowitsch and Gries (2003) have to say:

* you can categorize the verbs in semantic groups 
* you can see which ones attract, which ones repulse
* you can also integrate the V-ing into the analysis
...

---

# Steps taken

1. Get a corpus
2. Count all instances of the phemonenon you want to investigate (into construction)
3. Count all instances of categories that participate in your phenomenon (all verbs) + calculate the sum (abcd)
4. Do letter math with your tidy contingency table (choose a small set at first)
5. Perform statistical tests

It's simple in abstraction, but a bit annoying to actually do.
Because XML is hell.
But letter math is rad.

---

# Last words

---

# General conclusion

We have focused on some techniques to get data, and clean it so we can begin with our analysis.

R won't do it for you, but I hope that this bootcamp has shown you how you can integrate a bit of R in your workflow.

The most important things are regular expressions and how to quickly clean your dataset.
The tagging, unfortunately, is still best done by hand, but it's also good so you can familiarize yourself with your data.

And remember three things

1. Cheat sheet
2. Google is your friend
3. People from the computational lab are also your friend

---

# Good luck with R

---

background-image: url(https://media.giphy.com/media/upg0i1m4DLe5q/source.gif)
background-position: 50% 50%
background-size: 100%
class: center, bottom, clear