Data Extraction
Welcome to Monitoro Herd’s powerful data extraction system! This guide will walk you through how to extract structured data from web pages using our intuitive selector system and transformation pipelines.
Understanding Selectors
Herd provides a flexible and powerful way to extract data from web pages using a declarative JSON-based selector system.
Basic Extraction
The simplest form of extraction uses CSS selectors to target elements:
// Extract basic text content
const data = await page.extract({
title: 'h1', // Extracts the main heading
description: 'p', // Extracts the first paragraph
link: 'a' // Extracts the first link text
});
console.log(data.title); // "Welcome to Our Website"
console.log(data.description); // "This is our homepage."
Advanced Selector Syntax
For more complex extraction needs, use the expanded object syntax:
const data = await page.extract({
title: {
_$: 'h1', // CSS selector
attribute: 'id' // Extract the ID attribute instead of text
},
price: {
_$: '.price', // Target price element
pipes: ['parseNumber'] // Apply transformation
}
});
Extracting Lists of Items
To extract multiple elements that match a pattern, use the _$r
(repeat) selector:
const data = await page.extract({
items: {
_$r: '.item', // Find all elements with class "item"
title: 'h2', // For each item, get the title
price: '.price', // For each item, get the price
date: 'time' // For each item, get the date
}
});
// Access the extracted items
data.items.forEach(item => {
console.log(`${item.title}: ${item.price}, Posted: ${item.date}`);
});
Nested Extraction
You can nest selectors to extract hierarchical data:
const data = await page.extract({
product: {
name: '.product-name',
details: {
_$: '.product-details',
specs: {
_$r: '.spec-item',
label: '.spec-label',
value: '.spec-value'
}
}
}
});
Special Selectors
Herd provides special selectors to handle various extraction scenarios:
Root Selector (:root
)
The :root
selector refers to the current element in context:
const data = await page.extract({
items: {
_$r: '.item',
someElement: ':root', // Extract text of the .item element itself
classes: {
_$: ':root',
attribute: 'class' // Extract class attribute of the same element
}
}
});
Property Extraction
You can extract JavaScript properties from elements:
const data = await page.extract({
dimensions: {
_$: '.box',
property: 'getBoundingClientRect' // Get element dimensions
},
html: {
_$: '.content',
property: 'innerHTML' // Get inner HTML
}
});
Transformation Pipelines
Herd includes powerful transformation pipelines to process extracted data:
Available Transformations
Pipe | Description | Example Input | Example Output |
---|---|---|---|
trim |
Removes whitespace from start/end | " Hello " |
"Hello" |
toLowerCase |
Converts text to lowercase | "HELLO" |
"hello" |
toUpperCase |
Converts text to uppercase | "hello" |
"HELLO" |
parseNumber |
Extracts numbers from text | "$1,2K.45" |
1200.45 |
parseDate |
Converts text to date | "2024-01-15" |
"2024-01-15T00:00:00.000Z" |
parseDateTime |
Converts text to datetime | "2024-01-15T12:00:00Z" |
"2024-01-15T12:00:00.000Z" |
Using Transformations
Apply transformations using the pipes
property:
const data = await page.extract({
price: {
_$: '.price',
pipes: ['parseNumber'] // Convert "$1,234.56" to 1234.56
},
title: {
_$: 'h1',
pipes: ['trim', 'toLowerCase'] // Apply multiple transformations
}
});
Handling Currency and Large Numbers
The parseNumber
transformation handles various formats:
const data = await page.extract({
price1: {
_$: '.price-1', // Contains "$1,234.56"
pipes: ['parseNumber'] // Result: 1234.56
},
price2: {
_$: '.price-2', // Contains "$1.5M"
pipes: ['parseNumber'] // Result: 1500000
},
price3: {
_$: '.price-3', // Contains "1.5T€"
pipes: ['parseNumber'] // Result: 1500000000000
}
});
Real-World Examples
Let’s look at some practical examples of data extraction:
E-commerce Product Listing
Extract products from a search results page:
const searchResults = await page.extract({
products: {
_$r: '[data-component-type="s-search-result"]',
title: {
_$: 'h2 .a-link-normal',
pipes: ['trim']
},
price: {
_$: '.a-price .a-offscreen',
pipes: ['parseNumber']
},
rating: {
_$: '.a-icon-star-small .a-icon-alt',
pipes: ['trim']
},
reviews: {
_$: '.a-size-base.s-underline-text',
pipes: ['trim']
}
}
});
News Article List
Extract articles from a news site:
const articles = await page.extract({
items: {
_$r: '.item',
title: {
_$: 'h2',
pipes: ['trim', 'toLowerCase']
},
price: {
_$: '.price',
pipes: ['parseNumber']
},
date: {
_$: 'time',
pipes: ['parseDate']
}
}
});
Advanced Techniques
Handling Dynamic Content
For dynamic content that loads after the page is ready:
// Wait for dynamic content to load
await page.waitForElement('#dynamic span');
// Then extract the content
const data = await page.extract({
content: '#dynamic span'
});
Extracting Page Metadata
Extract information about the page itself:
const pageInfo = await page.extract({
title: 'title',
metaDescription: 'meta[name="description"]',
canonicalUrl: {
_$: 'link[rel="canonical"]',
attribute: 'href'
}
});
Tips for Effective Extraction
- Use Specific Selectors: The more specific your CSS selectors, the more reliable your extraction
- Test Incrementally: Build your extraction schema step by step, testing each part
- Handle Missing Data: Always account for elements that might not exist on the page
- Apply Appropriate Transformations: Use pipes to clean and format data as needed
- Combine with Interactions: For complex sites, interact with the page before extraction
Next Steps
Now that you understand Herd’s data extraction system, you can:
- Create complex extraction schemas for any website
- Transform raw data into structured, usable formats
- Build powerful automations that collect and process web data