Python web scraping cookbook over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS
Untangle your web scraping complexities and access web data with ease using Python scripts About This Book Hands-on recipes for advancing your web scraping skills to expert level. One-Stop Solution Guide to address complex and challenging web scraping tasks using Python. Understand the web page stru...
Other Authors: | |
---|---|
Format: | eBook |
Language: | Inglés |
Published: |
Birmingham, England ; Mumbai, [India] :
Packt
2018.
|
Edition: | 1st edition |
Subjects: | |
See on Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631652706719 |
Table of Contents:
- Cover
- Copyright and Credits
- Contributors
- Packt Upsell
- Table of Contents
- Preface
- Chapter 1: Getting Started with Scraping
- Introduction
- Setting up a Python development environment
- Getting ready
- How to do it...
- Scraping Python.org with Requests and Beautiful Soup
- Getting ready...
- How to do it...
- How it works...
- Scraping Python.org in urllib3 and Beautiful Soup
- Getting ready...
- How to do it...
- How it works
- There's more...
- Scraping Python.org with Scrapy
- Getting ready...
- How to do it...
- How it works
- Scraping Python.org with Selenium and PhantomJS
- Getting ready
- How to do it...
- How it works
- There's more...
- Chapter 2: Data Acquisition and Extraction
- Introduction
- How to parse websites and navigate the DOM using BeautifulSoup
- Getting ready
- How to do it...
- How it works
- There's more...
- Searching the DOM with Beautiful Soup's find methods
- Getting ready
- How to do it...
- Querying the DOM with XPath and lxml
- Getting ready
- How to do it...
- How it works
- There's more...
- Querying data with XPath and CSS selectors
- Getting ready
- How to do it...
- How it works
- There's more...
- Using Scrapy selectors
- Getting ready
- How to do it...
- How it works
- There's more...
- Loading data in unicode / UTF-8
- Getting ready
- How to do it...
- How it works
- There's more...
- Chapter 3: Processing Data
- Introduction
- Working with CSV and JSON data
- Getting ready
- How to do it
- How it works
- There's more...
- Storing data using AWS S3
- Getting ready
- How to do it
- How it works
- There's more...
- Storing data using MySQL
- Getting ready
- How to do it
- How it works
- There's more...
- Storing data using PostgreSQL
- Getting ready
- How to do it
- How it works
- There's more.
- Storing data in Elasticsearch
- Getting ready
- How to do it
- How it works
- There's more...
- How to build robust ETL pipelines with AWS SQS
- Getting ready
- How to do it - posting messages to an AWS queue
- How it works
- How to do it - reading and processing messages
- How it works
- There's more...
- Chapter 4: Working with Images, Audio, and other Assets
- Introduction
- Downloading media content from the web
- Getting ready
- How to do it
- How it works
- There's more...
- Parsing a URL with urllib to get the filename
- Getting ready
- How to do it
- How it works
- There's more...
- Determining the type of content for a URL
- Getting ready
- How to do it
- How it works
- There's more...
- Determining the file extension from a content type
- Getting ready
- How to do it
- How it works
- There's more...
- Downloading and saving images to the local file system
- How to do it
- How it works
- There's more...
- Downloading and saving images to S3
- Getting ready
- How to do it
- How it works
- There's more...
- Generating thumbnails for images
- Getting ready
- How to do it
- How it works
- Taking a screenshot of a website
- Getting ready
- How to do it
- How it works
- Taking a screenshot of a website with an external service
- Getting ready
- How to do it
- How it works
- There's more...
- Performing OCR on an image with pytesseract
- Getting ready
- How to do it
- How it works
- There's more...
- Creating a Video Thumbnail
- Getting ready
- How to do it
- How it works
- There's more..
- Ripping an MP4 video to an MP3
- Getting ready
- How to do it
- There's more...
- Chapter 5: Scraping - Code of Conduct
- Introduction
- Scraping legality and scraping politely
- Getting ready
- How to do it
- Respecting robots.txt
- Getting ready
- How to do it
- How it works.
- There's more...
- Crawling using the sitemap
- Getting ready
- How to do it
- How it works
- There's more...
- Crawling with delays
- Getting ready
- How to do it
- How it works
- There's more...
- Using identifiable user agents
- How to do it
- How it works
- There's more...
- Setting the number of concurrent requests per domain
- How it works
- Using auto throttling
- How to do it
- How it works
- There's more...
- Using an HTTP cache for development
- How to do it
- How it works
- There's more...
- Chapter 6: Scraping Challenges and Solutions
- Introduction
- Retrying failed page downloads
- How to do it
- How it works
- Supporting page redirects
- How to do it
- How it works
- Waiting for content to be available in Selenium
- How to do it
- How it works
- Limiting crawling to a single domain
- How to do it
- How it works
- Processing infinitely scrolling pages
- Getting ready
- How to do it
- How it works
- There's more...
- Controlling the depth of a crawl
- How to do it
- How it works
- Controlling the length of a crawl
- How to do it
- How it works
- Handling paginated websites
- Getting ready
- How to do it
- How it works
- There's more...
- Handling forms and forms-based authorization
- Getting ready
- How to do it
- How it works
- There's more...
- Handling basic authorization
- How to do it
- How it works
- There's more...
- Preventing bans by scraping via proxies
- Getting ready
- How to do it
- How it works
- Randomizing user agents
- How to do it
- Caching responses
- How to do it
- There's more...
- Chapter 7: Text Wrangling and Analysis
- Introduction
- Installing NLTK
- How to do it
- Performing sentence splitting
- How to do it
- There's more...
- Performing tokenization
- How to do it
- Performing stemming
- How to do it
- Performing lemmatization.
- How to do it
- Determining and removing stop words
- How to do it
- There's more...
- Calculating the frequency distributions of words
- How to do it
- There's more...
- Identifying and removing rare words
- How to do it
- Identifying and removing rare words
- How to do it
- Removing punctuation marks
- How to do it
- There's more...
- Piecing together n-grams
- How to do it
- There's more...
- Scraping a job listing from StackOverflow
- Getting ready
- How to do it
- There's more...
- Reading and cleaning the description in the job listing
- Getting ready
- How to do it...
- Chapter 8: Searching, Mining and Visualizing Data
- Introduction
- Geocoding an IP address
- Getting ready
- How to do it
- How to collect IP addresses of Wikipedia edits
- Getting ready
- How to do it
- How it works
- There's more...
- Visualizing contributor location frequency on Wikipedia
- How to do it
- Creating a word cloud from a StackOverflow job listing
- Getting ready
- How to do it
- Crawling links on Wikipedia
- Getting ready
- How to do it
- How it works
- Theres more...
- Visualizing page relationships on Wikipedia
- Getting ready
- How to do it
- How it works
- There's more...
- Calculating degrees of separation
- How to do it
- How it works
- There's more...
- Chapter 9: Creating a Simple Data API
- Introduction
- Creating a REST API with Flask-RESTful
- Getting ready
- How to do it
- How it works
- There's more...
- Integrating the REST API with scraping code
- Getting ready
- How to do it
- Adding an API to find the skills for a job listing
- Getting ready
- How to do it
- Storing data in Elasticsearch as the result of a scraping request
- Getting ready
- How to do it
- How it works
- There's more...
- Checking Elasticsearch for a listing before scraping
- How to do it
- There's more.
- Chapter 10: Creating Scraper Microservices with Docker
- Introduction
- Installing Docker
- Getting ready
- How to do it
- Installing a RabbitMQ container from Docker Hub
- Getting ready
- How to do it
- Running a Docker container (RabbitMQ)
- Getting ready
- How to do it
- There's more...
- Creating and running an Elasticsearch container
- How to do it
- Stopping/restarting a container and removing the image
- How to do it
- There's more...
- Creating a generic microservice with Nameko
- Getting ready
- How to do it
- How it works
- There's more...
- Creating a scraping microservice
- How to do it
- There's more...
- Creating a scraper container
- Getting ready
- How to do it
- How it works
- Creating an API container
- Getting ready
- How to do it
- There's more...
- Composing and running the scraper locally with docker-compose
- Getting ready
- How to do it
- There's more...
- Chapter 11: Making the Scraper as a Service Real
- Introduction
- Creating and configuring an Elastic Cloud trial account
- How to do it
- Accessing the Elastic Cloud cluster with curl
- How to do it
- Connecting to the Elastic Cloud cluster with Python
- Getting ready
- How to do it
- There's more...
- Performing an Elasticsearch query with the Python API
- Getting ready
- How to do it
- There's more...
- Using Elasticsearch to query for jobs with specific skills
- Getting ready
- How to do it
- Modifying the API to search for jobs by skill
- How to do it
- How it works
- There's more...
- Storing configuration in the environment
- How to do it
- Creating an AWS IAM user and a key pair for ECS
- Getting ready
- How to do it
- Configuring Docker to authenticate with ECR
- Getting ready
- How to do it
- Pushing containers into ECR
- Getting ready
- How to do it
- Creating an ECS cluster
- How to do it.
- Creating a task to run our containers.