Batch Crawling Web Page Data with Power BI

Posted by Ram on March 4, 2025 Leave a comment (0) Go to comments

Batch crawling is a practical way to pull the same kind of information from many web pages—product listings, policy pages, store locations, documentation pages—and bring it into a single, analyzable table. With Power BI, you can do this without standing up a separate scraping service: Power Query can iterate through a list of URLs, request each page, and extract the fields you care about, then Power BI can refresh the dataset on a schedule. The result is a lightweight “web crawler” pattern that’s especially useful for recurring reporting, monitoring changes, and consolidating semi-structured web content into a model you can visualize.

Build a Batch Web Crawler in Power Query for BI

In Power BI Desktop (Power Query), the core pattern is to start with a table of targets (URLs) and then apply a custom function that fetches and parses each page. Typically, you create a query like Pages that contains one row per URL (from an Excel sheet, SharePoint list, manual entry, or even generated URLs from an ID list). Then you author a function (for example, fnFetchPage) that takes a url parameter, calls Web.Contents(url), and transforms the response into structured columns—often by using Html.Table with CSS selectors/XPath-like mappings to extract headings, tables, or specific elements. After the function returns a record or table for a single page, you invoke it across the URL list (Add Column → Invoke Custom Function), expand the results into columns, and normalize the output (data types, missing values, deduping). The “batch” aspect comes from letting Power Query loop over many rows, while the “crawler” aspect comes from consistent extraction logic that turns each HTML page into the same schema—plus optional safeguards like try otherwise to prevent one failed page from breaking the entire refresh, and adding metadata columns such as crawl timestamp, HTTP status (where possible), or source URL to aid troubleshooting.

Schedule Crawls and Refresh Web Data in Power BI

Once your batch-crawl query works in Desktop, publishing it to the Power BI Service turns it into a refreshable pipeline—effectively scheduled crawling. After publishing, you configure dataset credentials and privacy levels so that the service can call the web sources, and then set up Scheduled Refresh (or use on-demand refresh) based on how often the web content changes and how much load you want to place on the target site. For stability, it helps to reduce the number of requests per refresh (filter to changed pages when possible), implement throttling patterns (e.g., avoid unnecessarily large pages or repeated calls), and keep parsing resilient so minor HTML changes don’t break the whole dataset. If you’re using a gateway for certain environments, ensure the gateway supports the connectivity needed, but note that many pure web requests can run cloud-to-web directly in the service. In practice, this setup lets you monitor web updates over time: each refresh re-fetches the pages, your model captures the latest values, and your reports can highlight changes, exceptions, or trends—turning what would otherwise be manual copy/paste into a governed, repeatable BI process.

Batch crawling web page data with Power BI is really a combination of two strengths: Power Query’s ability to iterate and transform semi-structured content, and the Power BI Service’s ability to refresh on a schedule. By building a URL-driven extraction function, applying it across a list, and publishing with a controlled refresh cadence, you can keep web-derived datasets current and report-ready—without maintaining separate scraping infrastructure. The key is to keep the extraction logic consistent, handle errors gracefully, and refresh responsibly so your “crawler” is both reliable and respectful of the sites it reads.