Skip to main content

Command Palette

Search for a command to run...

Scrape News website using Scrapy

How I scraped 5,50,000 Odia news articles

Updated
тАв7 min read
Scrape News website using Scrapy
S

I love programming, writing, travelling mountains, and mentoring. Let me know what you think at any of the following places:

Abstract

To feed data-hungry NLP models of recent times, I have scraped 5,50,000+ news articles from their websites which constitute an average of 50 lakhs sentences and 2 crores plus words of Monolingual Odia corpus. The dataset consists of a header and corresponding article content. In this blog post, I described why and how I had able to do this and how can we improve it further to extract more meaningful data.

Introduction

  • Odia being a vocabulary rich (рм╢рммрнНрмжрмХрнЛрм╖ рм╕рморнГрмжрнНрмз) language, has very little presence over the internet to feed the data-hungry giants of the latest developments in NLP.
  • There has been a constant demand for Odia corpus for these NLP tasks.
  • After the demand from various forums, the news publisher of Odisha upgraded themselves from uploading only e-papers (physical newspaper scan) to writing actual articles of news in Odia Unicode format on their websites.
  • Almost all the major Odia news agencies have been posting articles over the internet since 2019.
  • Therefore, I believed in two years enough data should have been there for me to extract to build a large corpus (рмнрмгрнНрмбрм╛рм░).

Planning and analysis

Challenges

Reduce the housekeeping tasks

  • Earlier I had written a Wikipedia parser to get data from Odia Wikipedia while using Beautifulsoup python library.
  • However, that involves many housekeeping tasks like building your own POST requests, sleeping in case of failure/rate-limiting and not being scalable.
  • I need to develop something easily without many maintenance tasks to keep track of.

Make it beginner-friendly and as simple as possible

  • How can I leverage the existing open-source library/framework to make it easy for beginners?
  • Just give only the minimum fields like where to crawl, what to scrape and what to store and it should handle the rest.

Possible Solutions

In the quest of resolving the above problems I scouted over the following libraries:

  1. Beautifulsoup
  2. Lxml
  3. Selenium
  4. Scrapy

Why did I choose Scrapy

Scrapy was on my radar for at least a couple of years. Due to procrastination, I was able to learn it. Finally, On this November, I was able to get my hand around it.

Pros of Scrapy

  • It is a full framed framework, built for a single purpose and it does that well i.e. crawl and extracts data from the websites.
  • It has ready-made features developed and no need to build from scratch.
  • It has support for rate limiting, proxy handling, user agent configuration, crawling and scraping based on expressions like XPath, CSS and HTML support.
  • Fast and Asynchronous, built on top of Twisted web framework.

Cons of scrapy

  • As it is a framework like Django, there is a learning curve initially and not so beginner-friendly. There are specific ways to do things around and particular places.
  • However, there are good tutorials available online for free to get around with Scrapy. A few tutorials I have mentioned at the end of this article.

How did I do this?

Scrapy.png

I have explained in a deep dive video, how I did it, you can refer to it.
Disclaimer: this video is in Odia language.

Start URL

  • A root/parent/base URL also called a start URL is used as the starting from where all the further URLs will be crawled.
  • There can be single or multiple start URLs.

Here is an example of the start URLs of the Sambad website.

def fetch_start_urls():
    start_date = date(2017, 9, 19)  # date from which sambad website started posting
    end_date = date.today()  # current date
    delta = timedelta(days=1) # getting the archive pages, one day at a time
    urls = []

    # Iterate though all the days starting from start date to today 
    while start_date <= end_date:
        urls += [f"https://sambad.in/date/{start_date.strftime('%Y/%m/%d')}"]
        start_date += delta
    return urls

Broadly, there are two XPath expressions used:

  1. Horizontal Crawling
  2. Vertical Crawling

Horizontal Crawling

  • In horizontal crawling, the pages of websites scrolled which are at the same level like the everyday archival of the posts for e.g. 31st December posts page, 30th December posts page.
  • On this page, all the news posts made on that day have been linked.
  • Scrapy crawls through all these days' archives i.e. called horizontal crawling.
  • Horizontal crawling also involves crawling through all pages of a daily archive post by clicking on the Next button.
Rule(LinkExtractor(restrict_xpaths="//a[@class='btn-bs-pagination next']"), follow=True)
  • The first rule is about the horizontal crawling browsing through all the websites in a day's archive by clicking through the рмкрм░рммрм░рнНрмдрнНрмдрнА рмкрнГрм╖рнНрмарм╛ (next) button.
  • You can check this web page for the 4th December's archive page
  • follow=True means crawl through the link and process it in the same Spider logic.

Vertical crawling

  • In vertical crawling, we go inside the posts start the website start URL and then inside the daily archive posts followed by the individual article links.
  • We go deeper inside pages crawling through links inside links.
Rule(LinkExtractor(restrict_xpaths="//a[@class='post-title post-url']"), follow=True, callback='parse_item')
  • In this rule, we direct the spider to fetch the URL of the XPath element and parse that link as mentioned in the function, parse_item.
  • This is vertical crawling into individual posts from the daily archive post page.

Extract fields from the page:

    def parse_item(self, response):
        l = ItemLoader(item=ScrapeprojectItem(), response=response)
        # Load fields using XPath expressions
        l.add_xpath('header', "//h1[@class='single-post-title']//text()",
                    MapCompose(lambda text: text.strip()), Join())
        l.add_xpath('content', "//div[@class='entry-content clearfix single-post-content']//text()",
                    MapCompose(lambda text: text.strip()), Join())
        return l.load_item()
  • There are two fields specifically I have extracted from the articles

    1. Header
    2. Content
  • There are two expressions provided one for the header and another for the content.

  • MapCompose do the post-processing after extracting text from the website.
  • Here MapCompose removes additional spaces and new line characters from the extracted field and join the final result into a single string of the list.

You can see the full spider written for Sambad website here:

from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Join
from scrapy.spiders import CrawlSpider, Rule

from scrapeProject.items import ScrapeprojectItem
from datetime import date, timedelta


def fetch_start_urls():
    start_date = date(2017, 9, 19)
    end_date = date.today()
    delta = timedelta(days=1)
    urls = []
    while start_date <= end_date:
        urls += [f"https://sambad.in/date/{start_date.strftime('%Y/%m/%d')}"]
        start_date += delta
    return urls


class SambadSpider(CrawlSpider):
    name = 'sambad'
    start_urls = fetch_start_urls()

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='btn-bs-pagination next']"), follow=True),
        Rule(LinkExtractor(restrict_xpaths="//a[@class='post-title post-url']"), follow=True, callback='parse_item'),
    )

    def parse_item(self, response):
        l = ItemLoader(item=ScrapeprojectItem(), response=response)
        # Load fields using XPath expressions
        l.add_xpath('header', "//h1[@class='single-post-title']//text()",
                    MapCompose(lambda text: text.strip()), Join())
        l.add_xpath('content', "//div[@class='entry-content clearfix single-post-content']//text()",
                    MapCompose(lambda text: text.strip()), Join())
        return l.load_item()

The full source code can be found here: https://bit.ly/OdiaNLPSambad

Data scraped

News website and corresponding articles data extracted

Website# of ArticlesSize
Sambad2,38,680947 MB
Dharitri1,21,368388.2 MB
Samaja1,02,671356.7 MB
Prameya70,918270.7 MB
Pragativadi12,98736.1 MB
Samaya3,2208.6 MB

Sample Article

{"header": [" рм░рмермЦрм│рм╛рм░рнЗ рмирм┐рм░рнНрморм╛рмг рм╣рнЗрмм рм░рме; рмпрм╛рмдрнНрм░рм╛ рмирм┐рм╖рнНрмкрмдрнНрмдрм┐ рмирнЗрммрнЗ рм░рм╛рмЬрнНрнЯрм╕рм░рмХрм╛рм░"], "content": [" Edited by Papu Mohanty  рмнрнБрммрмирнЗрм╢рнНрн▒рм░, (рм╕рммрнБрнНрнЯ): 07/05:рм░рме рмирм┐рм░рнНрморм╛рмг рмирнЗрмЗ рмХрнЗрмирнНрмжрнНрм░рм╕рм░рмХрм╛рм░ рм░рм╛рмЬрнНрнЯрмХрнБ рмжрнЗрмЗрмЫрмирнНрмдрм┐ рм╕рммрнБрмЬ рм╕рмЩрнНрмХрнЗрмд ред рмП рм╕рморнНрмкрм░рнНрмХрм░рнЗ рмХрнЗрмирнНрмжрнНрм░ рмЧрнГрм╣рмормирнНрмдрнНрм░рмгрм╛рм│рнЯ рмкрмХрнНрм╖рм░рнБ рмУрмбрм┐рм╢рм╛ рморнБрмЦрнНрнЯ рм╕рмЪрм┐рммрмЩрнНрмХрнБ рмЪрм┐рмарм┐ рм▓рнЗрмЦрм┐ рмЕрмирнБрмормдрм┐ рмкрнНрм░рмжрм╛рми рм╕рморнНрмкрм░рнНрмХрм░рнЗ рмЬрмгрм╛рмЗрмжрм┐рмЖрмпрм╛рмЗрмЫрм┐ ред рмдрнЗрммрнЗ рмПрм╣рм┐ рмЕрмирнБрмормдрм┐ рм╕рм░рнНрмдрморнВрм│рмХ ред рмЧрнГрм╣рмормирнНрмдрнНрм░рмгрм╛рм│рнЯрм░ рм╕рм░рнНрмдрнНрмд рмЕрмирнБрм╕рм╛рм░рнЗ рм░рмермЦрм│рм╛рм░рнЗ рм░рме рмирм┐рм░рнНрморм╛рмг рмХрм╛рм░рнНрмпрнНрнЯ рмХрм░рм╛рмпрм┐рмм рмдрнЗрммрнЗ рм╕рнЗрмарм╛рм░рнЗ рмХрнМрмгрм╕рм┐ рмзрм╛рм░рнНрморм┐рмХ рмПрмХрмдрнНрм░рнА рмХрм░рмг рм╣рнЛрмЗрмкрм╛рм░рм┐рммрмирм┐ ред рм╕рм╛рморм╛рмЬрм┐ рмжрнВрм░рмдрм╛рм░ рмХрмбрм╛ рмЕрмирнБрмкрм╛рм│рми рм╣рнЗрмм ред рм▓рмХрнНтАМрмбрм╛рмЙрмирнНтАМ рм╕рм░рнНрморнНрмкрмХрм┐рмд рм╕рморм╕рнНрме рмЧрм╛рмЗрмбрнНтАМрм▓рм╛рмЗрмирмХрнБ рмХрмбрм╛рмХрмбрм┐ рмнрм╛рммрнЗ рмкрм╛рм│рми рмХрм░рм╛рмпрм┐рмм ред рм╕рнВрмЪрмирм╛рмпрнЛрмЧрнНрнЯ рмпрнЗ рм░рмермпрм╛рмдрнНрм░рм╛ рмкрм╛рмЗрмБ рм░рм╛рмЬрнНрнЯ рм╕рм░рмХрм╛рм░ рмХрнЗрмирнНрмжрнНрм░рмХрнБ рмЕрмирнБрмормдрм┐ рморм╛рмЧрм┐рмерм┐рм▓рнЗ ред рмдрнЗрммрнЗ рм░рме рмирм┐рм░рнНрморм╛рмг рмкрм╛рмЗрмБ рмХрнЗрмирнНрмжрнНрм░ рм╕рммрнБрмЬ рм╕рмЩрнНрмХрнЗрмд рмжрнЗрмЗрмерм┐рм▓рнЗ рмормзрнНрнЯ рм░рмермпрм╛рмдрнНрм░рм╛ рм╣рнЗрммрмХрм┐ рмирм╛рм╣рм┐рмБ рмдрм╛рм╣рм╛рм░ рмирм┐рм╖рнНрмкрмдрнНрмдрм┐ рм░рм╛рмЬрнНрнЯрм╕рм░рмХрм╛рм░рмЩрнНрмХ рмЙрмкрм░рнЗ рмЫрм╛рмбрм┐рмжрнЗрмЗрмЫрмирнНрмдрм┐ ред рм░рмермпрм╛рмдрнНрм░рм╛ рм╣рнЗрммрмХрм┐рмирм╛рм╣рм┐рмБ рм╕рнЗрмдрнЗрммрнЗрм│рм░ рм╕рнНрмерм┐рмдрм┐рмХрнБ рмЖрмХрм│рми рмХрм░рм┐ рмпрм╛рм╣рм╛ рмирм┐рм╖рнНрмкрмдрнНрмдрм┐ рм░рм╛рмЬрнНрнЯрм╕рм░рмХрм╛рм░ рмирнЗрммрнЗрммрнЛрм▓рм┐ рмХрнБрм╣рм╛рмпрм╛рмЗрмЫрм┐ ред  "]}

Future scopes

  • Utilization of the Monolingual Corpus for
    • Sentence auto-completion
    • Masked token prediction
    • Text Classification
    • Question and Answering
  • Freelancing experience gain

Scrapy Tutorials

Text-based

Video-based

Conclusion

In this article, I have described how I chose scrapy as the data scraping framework to extract monolingual corpus from Odia news websites. I have described how using XPath we can do vertical and horizontal crawling of websites. I have described how can we fetch specific fields from a website by crawling through its web pages. These data can help further improve the computational linguistic capabilities of the Odia language.

Further References


If you find this article useful, please cite this using:

@misc{Soumendrak,
    author       = {Soumendra Kumar Sahoo},
    title        = {Odia Monolingual field based extraction},
    howpublished = {\url{https://www.blog.soumendrak.com/}},
    year         = {2021}
}
B

Thanks for the awesome post bhai. Can you explain this over a video? It will be really useful for newbies like me. Thanks a lot.

S

Sure, will try to do it, maybe on YouTube.

Python programming

Part 20 of 30

In this series, you can find my posts about Python programming at a single place.

Up next

Cool Python tricks you are not using, but you should.

Cool Python tips and tricks