From Raw HTML to Clean Data: Structured Extraction Techniques
The raw HTML of a webpage is messy. Extracting clean, structured data requires understanding selectors, handling edge cases, and normalizing data into consistent formats.
Selector Strategies
Two main approaches exist for locating data in HTML: CSS selectors and XPath. Each has strengths depending on the situation.
CSS Selectors
CSS selectors are intuitive for anyone familiar with web development. They're the default choice for most scraping tasks.
// Select by class
document.querySelector('.product-title')
// Select by attribute
document.querySelector('[data-product-id]')
// Nested selection
document.querySelector('.product-card .price') XPath
XPath is more powerful for complex selections, especially when you need to navigate up the DOM tree or select by text content.
// Select by text content
//div[contains(text(), 'Price:')]
// Navigate to parent
//span[@class='price']/parent::div
// Select nth element
(//div[@class='product'])[3] Handling Edge Cases
Real-world data is messy. Your extraction code needs to handle:
- Missing elements - Always check if an element exists
- Multiple formats - Prices might be "$100" or "100 USD"
- Whitespace - Extra spaces, newlines, and tabs
- HTML entities - and other encoded characters
Data Normalization
Once extracted, data needs to be normalized for consistency:
Text Cleaning
function cleanText(text) {
return text
.trim()
.replace(/\\s+/g, ' ')
.replace(/\\n/g, '')
} Price Parsing
function parsePrice(priceStr) {
const cleaned = priceStr.replace(/[^0-9.,]/g, '')
return parseFloat(cleaned.replace(',', ''))
} Date Standardization
function standardizeDate(dateStr) {
const parsed = new Date(dateStr)
return parsed.toISOString().split('T')[0]
} Building Robust Extractors
A robust extractor combines multiple strategies:
- Try the primary selector
- Fall back to alternative selectors
- Clean and normalize the data
- Validate the result
- Log warnings for unexpected patterns
Tools for Testing
Before running your scraper at scale, test your selectors:
- Browser DevTools - Test selectors in the Console
- JSON Vision - Our free extension for inspecting API responses
- Scrapy Shell - Interactive testing for Python scrapers
Clean data extraction is part art, part science. With practice, you'll develop intuition for building extractors that handle real-world messiness gracefully.