Webbots, spiders, and screen scrapers a guide to developing Internet agents with PHP/CURL

The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. There's no reason to let browsers limit your online experience-especially wh...

Descripción completa

Detalles Bibliográficos
Autor principal:	Schrenk, Michael (-)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	San Francisco : No Starch Press c2007.
Edición:	1st edition
Materias:	Web search engines. Internet programming. Internet searching. Intelligent agents (Computer software)
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009627073806719

Tabla de Contenidos:

Intro
Acknowledgments
Tables of Contents
Introduction
Old-School Client-Server Technology
The Problem with Browsers
What to Expect from This Book
Learn from My Mistakes
Master Webbot Techniques
Leverage Existing Scripts
About the Website
About the Code
Requirements
Hardware
Software
Internet Access
A Disclaimer (This Is Important)
PART I: Fundamental Concepts and Techniques
1: What's in It for You?
Uncovering the Internet's True Potential
What's in It for Developers?
Webbot Developers Are in Demand
Webbots Are Fun to Write
Webbots Facilitate "Constructive Hacking"
What's in It for Business Leaders?
Customize the Internet for Your Business
Capitalize on the Public's Inexperience with Webbots
Accomplish a Lot with a Small Investment
Final Thoughts
2: Ideas for Webbot Projects
Inspiration from Browser Limitations
Webbots That Aggregate and Filter Information for Relevance
Webbots That Interpret What They Find Online
Webbots That Act on Your Behalf
A Few Crazy Ideas to Get You Started
Help Out a Busy Executive
Save Money by Automating Tasks
Protect Intellectual Property
Monitor Opportunities
Verify Access Rights on a Website
Create an Online Clipping Service
Plot Unauthorized Wi-Fi Networks
Track Web Technologies
Allow Incompatible Systems to Communicate
Final Thoughts
3: Downloading Web Pages
Think About Files, Not Web Pages
Downloading Files with PHP's Built-in Functions
Downloading Files with fopen() and fgets()
Downloading Files with file()
Introducing PHP/CURL
Multiple Transfer Protocols
Form Submission
Basic Authentication
Cookies
Redirection
Agent Name Spoofing
Referer Management
Socket Management
Installing PHP/CURL
LIB_http.
Familiarizing Yourself with the Default Values
Using LIB_http
Learning More About HTTP Headers
Examining LIB_http's Source Code
Final Thoughts
4: Parsing Techniques
Parsing Poorly Written HTML
Standard Parse Routines
Using LIB_parse
Splitting a String at a Delimiter: split_string()
Parsing Text Between Delimiters: return_between()
Parsing a Data Set into an Array: parse_array()
Parsing Attribute Values: get_attribute()
Removing Unwanted Text: remove()
Useful PHP Functions
Detecting Whether a String Is Within Another String
Replacing a Portion of a String with Another String
Parsing Unformatted Text
Measuring the Similarity of Strings
Final Thoughts
Don't Trust a Poorly Coded Web Page
Parse in Small Steps
Don't Render Parsed Text While Debugging
Use Regular Expressions Sparingly
5: Automating Form Submission
Reverse Engineering Form Interfaces
Form Handlers, Data Fields, Methods, and Event Triggers
Form Handlers
Data Fields
Methods
Event Triggers
Unpredictable Forms
JavaScript Can Change a Form Just Before Submission
Form HTML Is Often Unreadable by Humans
Cookies Aren't Included in the Form, but Can Affect Operation
Analyzing a Form
Final Thoughts
Don't Blow Your Cover
Correctly Emulate Browsers
Avoid Form Errors
6: Managing Large Amounts of Data
Organizing Data
Naming Conventions
Storing Data in Structured Files
Storing Text in a Database
Storing Images in a Database
Database or File?
Making Data Smaller
Storing References to Image Files
Compressing Data
Removing Formatting
Thumbnailing Images
Final Thoughts
PART II: Projects
7: Price-Monitoring Webbots
The Target
Designing the Parsing Script
Initialization and Downloading the Target
Further Exploration.
8: Image-Capturing Webbots
Example Image-Capturing Webbot
Creating the Image-Capturing Webbot
Binary-Safe Download Routine
Directory Structure
The Main Script
Further Exploration
Final Thoughts
9: Link-Verification Webbots
Creating the Link-Verification Webbot
Initializing the Webbot and Downloading the Target
Setting the Page Base
Parsing the Links
Running a Verification Loop
Generating Fully Resolved URLs
Downloading the Linked Page
Displaying the Page Status
Running the Webbot
LIB_http_codes
LIB_resolve_addresses
Further Exploration
10: Anonymous Browsing Webbots
Anonymity with Proxies
Non-proxied Environments
Your Online Exposure
Proxied Environments
The Anonymizer Project
Writing the Anonymizer
Final Thoughts
11: Search-Ranking Webbots
Description of a Search Result Page
What the Search-Ranking Webbot Does
Running the Search-Ranking Webbot
How the Search-Ranking Webbot Works
The Search-Ranking Webbot Script
Initializing Variables
Starting the Loop
Fetching the Search Results
Parsing the Search Results
Final Thoughts
Be Kind to Your Sources
Search Sites May Treat Webbots Differently Than Browsers
Spidering Search Engines Is a Bad Idea
Familiarize Yourself with the Google API
Further Exploration
12: Aggregation Webbots
Choosing Data Sources for Webbots
Example Aggregation Webbot
Familiarizing Yourself with RSS Feeds
Writing the Aggregation Webbot
Adding Filtering to Your Aggregation Webbot
Further Exploration
13: FTP Webbots
Example FTP Webbot
PHP and FTP
Further Exploration
14: NNTP News Webbots
NNTP Use and History
Webbots and Newsgroups
Identifying News Servers
Identifying Newsgroups
Finding Articles in Newsgroups
Reading an Article from a Newsgroup.
Further Exploration
15: Webbots That Read Email
The POP3 Protocol
Logging into a POP3 Mail Server
Reading Mail from a POP3 Mail Server
Executing POP3 Commands with a Webbot
Further Exploration
Email-Controlled Webbots
Email Interfaces
16: Webbots That Send Email
Email, Webbots, and Spam
Sending Mail with SMTP and PHP
Configuring PHP to Send Mail
Sending an Email with mail()
Writing a Webbot That Sends Email Notifications
Keeping Legitimate Mail out of Spam Filters
Sending HTML-Formatted Email
Further Exploration
Using Returned Emails to Prune Access Lists
Using Email as Notification That Your Webbot Ran
Leveraging Wireless Technologies
Writing Webbots That Send Text Messages
17: Converting a Website into a Function
Writing a Function Interface
Defining the Interface
Analyzing the Target Web Page
Using describe_zipcode()
Final Thoughts
Distributing Resources
Using Standard Interfaces
Designing a Custom Lightweight "Web Service"
PART III: Advanced Technical Considerations
18: Spiders
How Spiders Work
Example Spider
LIB_simple_spider
harvest_links()
archive_links()
get_domain()
exclude_link()
Experimenting with the Spider
Adding the Payload
Further Exploration
Save Links in a Database
Separate the Harvest and Payload
Distribute Tasks Across Multiple Computers
Regulate Page Requests
19: Procurement Webbots and Snipers
Procurement Webbot Theory
Get Purchase Criteria
Authenticate Buyer
Verify Item
Evaluate Purchase Triggers
Make Purchase
Evaluate Results
Sniper Theory
Get Purchase Criteria
Authenticate Buyer
Verify Item
Synchronize Clocks
Time to Bid?
Submit Bid
Evaluate Results
Testing Your Own Webbots and Snipers
Further Exploration
Final Thoughts.
20: Webbots and Cryptography
Designing Webbots That Use Encryption
SSL and PHP Built-in Functions
Encryption and PHP/CURL
A Quick Overview of Web Encryption
Local Certificates
Final Thoughts
21: Authentication
What Is Authentication?
Types of Online Authentication
Strengthening Authentication by Combining Techniques
Authentication and Webbots
Example Scripts and Practice Pages
Basic Authentication
Session Authentication
Authentication with Cookie Sessions
Authentication with Query Sessions
Final Thoughts
22: Advanced Cookie Management
How Cookies Work
PHP/CURL and Cookies
How Cookies Challenge Webbot Design
Purging Temporary Cookies
Managing Multiple Users' Cookies
Further Exploration
23: Scheduling Webbots and Spiders
The Windows Task Scheduler
Preparing Your Webbots to Run as Scheduled Tasks
Scheduling a Webbot to Run Daily
Complex Schedules
Non-Calendar-Based Triggers
Final Thoughts
Determine the Webbot's Best Periodicity
Avoid Single Points of Failure
Add Variety to Your Schedule
PART IV: Larger Considerations
24: Designing Stealthy Webbots and Spiders
Why Design a Stealthy Webbot?
Log Files
Log-Monitoring Software
Stealth Means Simulating Human Patterns
Be Kind to Your Resources
Run Your Webbot During Busy Hours
Don't Run Your Webbot at the Same Time Each Day
Don't Run Your Webbot on Holidays and Weekends
Use Random, Intra-fetch Delays
Final Thoughts
25: Writing Fault-Tolerant Webbots
Types of Webbot Fault Tolerance
Adapting to Changes in URLs
Adapting to Changes in Page Content
Adapting to Changes in Forms
Adapting to Changes in Cookie Management
Adapting to Network Outages and Network Congestion
Error Handlers
26: Designing Webbot-Friendly Websites.
Optimizing Web Pages for Search Engine Spiders.

Webbots, spiders, and screen scrapers a guide to developing Internet agents with PHP/CURL

Ejemplares similares