Dataharvest 2024
I give you a URL and you give me back LEGO bricks - OR - How to use an LLM to write your first scraper
Workshop at Dataharvest 2024, 2024/Jun/01
Participants will write a simple web scraper in python. A Large Language Models (LLM) will be used as a coding assistant to write the scraper’s skeleton and help participants overcome obstacles. Participants will also learn that LLMs, can’t be trusted blindly: judgment and domain expertise are needed to navigate the LLM’s answers. After this session participants will be able to implement a very basic web scraper with python; use an LLM as coding assistant and understand its limitations.
No coding skills are required. (Participants should be comfortable copy-pasting python code even if they don’t understand 100% what it’s doing). A very basic (even if abstract) understanding of what a web scraper does will help the participants follow the workshop flow. We will be using Google Collab (a web tool) to write and execute python; no additional software is needed on your computer.
Part 1
We discussed LLMs briefly. Why they are interesting, what they are good at, what they are bad at, and why they are interesting as a teaching assistant for programming.
Part 2
Setup you development environment
If you want to avoid installing python locally:
- Go to Google Colab
- Sign in with your Google account. This is required to run the code.
Or install python locally
Part 3
Open in a separate tab the page we want to scrape with Chrome. It’s a list of LEGO bricks. The page is static and runs no Javascript.
Part 4
Let’s go to the first question!
Appendix
The saved full ChatGPT log
The slides of the presentation.
Versions used:
-
ChatGPT: GPT‑4o (Free)
-
Chrome: 124