STAT 7008 - Assignment 3
Note: A3 is 20% of the overall assessment. The 100 points in A3 will be rescaled to 20% in
the final score.
Web Scraping
1. (25 points) Crawl information from https://www.sciencedirect.com
(1) (13 points) Crawl some key information about all articles published in 2022 from the
website https://www.sciencedirect.com/journal/journal-of-econometrics/issues, including
year, volume, article content, title, authors and pages. Crawl the volume numbers from 226
to 230 only.
(2) (6 points) Remove “\xa0” in volume_name and store the crawled data into pandas
DataFrame.
(3) (6 points) Filter the author with Null value and then find the top 10 authors that published
the most articles.
Hint:
i. Click the button of the targeted item
ii. Pass the html to BeautifulSoup and get all links
iii. Use requests to get article content, title, authors and pages for each block