Skip to content

Commit 16138ac

Browse files
authoredOct 1, 2020
Add files via upload
1 parent b2b642e commit 16138ac

File tree

2 files changed

+590
-0
lines changed

2 files changed

+590
-0
lines changed
 

‎07-lab.Rmd

+225
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
---
2+
title: "Lab 07 - Web scraping and Regular Expressions"
3+
output: github_document
4+
---
5+
6+
```{r setup}
7+
knitr::opts_chunk$set(include = TRUE)
8+
```
9+
10+
11+
# Learning goals
12+
13+
- Use a real world API to make queries and process the data.
14+
- Use regular expressions to parse the information.
15+
- Practice your GitHub skills.
16+
17+
# Lab description
18+
19+
In this lab, we will be working with the [NCBI API](https://www.ncbi.nlm.nih.gov/home/develop/api/)
20+
to make queries and extract information using XML and regular expressions. For this lab, we will
21+
be using the `httr`, `xml2`, and `stringr` R packages.
22+
23+
This markdown document should be rendered using `github_document` document.
24+
25+
## Question 1: How many sars-cov-2 papers?
26+
27+
Build an automatic counter of sars-cov-2 papers using PubMed. You will need to apply XPath as we did during the lecture to extract the number of results returned by PubMed in the following web address:
28+
29+
```
30+
https://pubmed.ncbi.nlm.nih.gov/?term=sars-cov-2
31+
```
32+
33+
Complete the lines of code:
34+
35+
```{r counter-pubmed, eval=TRUE, cache=TRUE}
36+
# Downloading the website
37+
website <- xml2::read_html("https://pubmed.ncbi.nlm.nih.gov/?term=sars-cov-2")
38+
# Finding the counts
39+
counts <- xml2::xml_find_first(website, "/html/body/main/div[9]/div[2]/div[2]/div[1]/span")
40+
# Turning it into text
41+
counts <- as.character(counts)
42+
# Extracting the data using regex
43+
stringr::str_extract(counts, "[0-9,]+")
44+
```
45+
46+
Don't forget to commit your work!
47+
48+
## Question 2: Academic publications on COVID19 and Hawaii
49+
50+
You need to query the following
51+
The parameters passed to the query are documented [here](https://www.ncbi.nlm.nih.gov/books/NBK25499/).
52+
53+
Use the function `httr::GET()` to make the following query:
54+
55+
1. Baseline URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
56+
57+
2. Query parameters:
58+
59+
- db: pubmed
60+
- term: covid19 hawaii
61+
- retmax: 1000
62+
63+
```{r papers-covid-hawaii, eval=TRUE}
64+
library(httr)
65+
66+
query_ids <- GET(
67+
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
68+
query = list(
69+
db ="pubmed",
70+
term= "covid19 hawaii",
71+
retmax= 1000
72+
)
73+
)
74+
# Extracting the content of the response of GET
75+
ids <- httr::content(query_ids)
76+
```
77+
78+
The query will return an XML object, we can turn it into a character list to
79+
analyze the text directly with `as.character()`. Another way of processing the
80+
data could be using lists with the function `xml2::as_list()`. We will skip the
81+
latter for now.
82+
83+
Take a look at the data, and continue with the next question (don't forget to
84+
commit and push your results to your GitHub repo!).
85+
86+
## Question 3: Get details about the articles
87+
88+
The Ids are wrapped around text in the following way: `<Id>... id number ...</Id>`.
89+
we can use a regular expression that extract that information. Fill out the
90+
following lines of code:
91+
92+
```{r get-ids, eval = TRUE}
93+
# Turn the result into a character vector
94+
ids <- as.character(ids)
95+
# Find all the ids
96+
ids <- stringr::str_extract_all(ids, "<Id>[0-9]+</Id>")[[1]]
97+
# Remove all the leading and trailing <Id> </Id>. Make use of "|"
98+
ids <- stringr::str_remove_all(ids, "<Id>|</Id>")
99+
```
100+
101+
With the ids in hand, we can now try to get the abstracts of the papers. As
102+
before, we will need to coerce the contents (results) to a list using:
103+
104+
1. Baseline url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
105+
106+
2. Query parameters:
107+
- db: pubmed
108+
- id: A character with all the ids separated by comma, e.g., "1232131,546464,13131"
109+
- retmax: 1000
110+
- rettype: abstract
111+
112+
**Pro-tip**: If you want `GET()` to take some element literal, wrap it around `I()` (as you would do in a formula in R). For example, the text `"123,456"` is replaced with `"123%2C456"`. If you don't want that behavior, you would need to do the following `I("123,456")`.
113+
114+
```{r get-abstracts, eval = TRUE}
115+
publications <- GET(
116+
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
117+
query = list(
118+
db= "pubmed",
119+
id = paste(ids, collapse = ","),
120+
retmax = 1000,
121+
rettype = "abstract"
122+
)
123+
)
124+
# Turning the output into character vector
125+
publications <- httr::content(publications)
126+
publications_txt <- as.character(publications)
127+
```
128+
129+
With this in hand, we can now analyze the data. This is also a good time for committing and pushing your work!
130+
131+
## Question 4: Distribution of universities, schools, and departments
132+
133+
Using the function `stringr::str_extract_all()` applied on `publications_txt`, capture all the terms of the form:
134+
135+
1. University of ...
136+
2. ... Institute of ...
137+
138+
Write a regular expression that captures all such instances
139+
140+
```{r univ-institute-regex, eval = TRUE}
141+
library(stringr)
142+
institution <- str_extract_all(
143+
publications_txt,
144+
"University of [[:alpha:]]+|[[:alpha:]]+ Institute of [[:alpha:]]+"
145+
)
146+
institution <- unlist(institution)
147+
table(institution)
148+
```
149+
150+
Repeat the exercise and this time focus on schools and departments in the form of
151+
152+
1. School of ...
153+
2. Department of ...
154+
155+
And tabulate the results
156+
157+
```{r school-department, eval = TRUE}
158+
schools_and_deps <- str_extract_all(
159+
publications_txt,
160+
"School of\\s[[:alpha:]]+|Department of\\s[[:alpha:]]+"
161+
)
162+
table(schools_and_deps)
163+
```
164+
165+
## Question 5: Form a database
166+
167+
We want to build a dataset which includes the title and the abstract of the
168+
paper. The title of all records is enclosed by the HTML tag `ArticleTitle`, and
169+
the abstract by `Abstract`.
170+
171+
Before applying the functions to extract text directly, it will help to process
172+
the XML a bit. We will use the `xml2::xml_children()` function to keep one element
173+
per id. This way, if a paper is missing the abstract, or something else, we will be able to properly match PUBMED IDS with their corresponding records.
174+
175+
176+
```{r one-string-per-response, eval = TRUE}
177+
pub_char_list <- xml2::xml_children(publications)
178+
pub_char_list <- sapply(pub_char_list, as.character)
179+
```
180+
181+
Now, extract the abstract and article title for each one of the elements of
182+
`pub_char_list`. You can either use `sapply()` as we just did, or simply
183+
take advantage of vectorization of `stringr::str_extract`
184+
185+
```{r extracting-last-bit, eval = TRUE}
186+
abstracts <- str_extract(pub_char_list, "<Abstract>(\\n|.)+</Abstract>")
187+
abstracts <- str_remove_all(abstracts, "</?[[:alnum:]]+>")
188+
abstracts <- str_replace_all(abstracts, "\\s+", " ")
189+
```
190+
191+
How many of these don't have an abstract? Now, the title
192+
193+
```{r process-titles, eval = TRUE}
194+
titles <- str_extract(pub_char_list, "<ArticleTitle>(\\n|.)+</ArticleTitle>")
195+
titles <- str_remove_all(titles, "</?[[:alnum:]]+>")
196+
titles <- str_replace_all(titles, "\\s+", " ")
197+
```
198+
199+
Finally, put everything together into a single `data.frame` and use
200+
`knitr::kable` to print the results
201+
202+
```{r build-db, eval = TRUE}
203+
database <- data.frame(
204+
PubMedID = ids,
205+
Title =titles,
206+
Abstracts = abstracts
207+
)
208+
knitr::kable(database)
209+
```
210+
211+
Done! Knit the document, commit, and push.
212+
213+
## Final Pro Tip (optional)
214+
215+
You can still share the HTML document on github. You can include a link in your `README.md` file as the following:
216+
217+
```md
218+
View [here](https://ghcdn.rawgit.org/:user/:repo/:tag/:file)
219+
```
220+
221+
For example, if we wanted to add a direct link the HTML page of lecture 7, we could do something like the following:
222+
223+
```md
224+
View [here](https://ghcdn.rawgit.org/USCbiostats/PM566/master/static/slides/07-apis-regex/slides.html)
225+
```

‎07-lab.md

+365
Large diffs are not rendered by default.

2 commit comments

Comments
 (2)

Weijia-H commented on Oct 1, 2020

@Weijia-H
OwnerAuthor

gvegayon commented on Oct 5, 2020

@gvegayon

Good job, @Weijia-H! I only would recommend making more commits in between. It is always a good practice to do so :)

Please sign in to comment.