Vangelis Katsikaros

Dataharvest 2024

I give you a URL and you give me back LEGO bricks - OR - How to use an LLM to write your first scraper

Workshop at Dataharvest 2024, 2024/Jun/01

Participants will write a simple web scraper in python. A Large Language Models (LLM) will be used as a coding assistant to write the scraper’s skeleton and help participants overcome obstacles. Participants will also learn that LLMs, can’t be trusted blindly: judgment and domain expertise are needed to navigate the LLM’s answers. After this session participants will be able to implement a very basic web scraper with python; use an LLM as coding assistant and understand its limitations.

No coding skills are required. (Participants should be comfortable copy-pasting python code even if they don’t understand 100% what it’s doing). A very basic (even if abstract) understanding of what a web scraper does will help the participants follow the workshop flow. We will be using Google Collab (a web tool) to write and execute python; no additional software is needed on your computer.

Part 1

We discussed LLMs briefly. Why they are interesting, what they are good at, what they are bad at, and why they are interesting as a teaching assistant for programming.

Part 2

Setup you development environment

If you want to avoid installing python locally:

Or install python locally

Part 3

Open in a separate tab the page we want to scrape with Chrome. It’s a list of LEGO bricks. The page is static and runs no Javascript.

Part 4

Let’s go to the first question!

Appendix

The saved full ChatGPT log

The slides of the presentation.

Versions used: