Vangelis Katsikaros

Question (Step) 16

Instead of asking the LLM another question I returned to question 13 where the output was good enough. This is the HTML code that interests us:

<td rowspan="2" width="300px" class="pspItemClick">
    <a href="/v2/catalog/catalogitem.page?P=3005&amp;name=Brick%201%20x%201&amp;category=%5BBrick%5D" class="pspItemNameLink">Brick 1 x 1</a>
    <br>
    <span class="pspItemCateAndNo">
        <span class="blcatList">
            <a class="_blcatLink" onclick="" href="//www.bricklink.com/catalogList.asp?catType=P&amp;catString=5">Brick</a>
        </span> : 3005
    </span>
    <span class="pspPCC">
    </span>
</td>

At this stage I decided not to ask help from the LLM. I learned enough about code that selects elements, and I am confident I can copy/paste and modify code, in order to do what I want. The only new addition, to what we asked so far, is the code text.replace('Brick : ', ''), that manipulates the text. For that I could either ask the LLM, or read the python documentation. Whatever is easier for you!

import requests
from bs4 import BeautifulSoup
import csv
import io

url = 'https://vkatsikaros.github.io/dataharvest24-www.github.io/'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    div = soup.find('div', id='_idItemTableForP')
    if div is not None:
        table = div.find('table')

        if table is not None:

            rows = []
            tbody = table.find('tbody')
            if tbody is not None:
                row_index = 0
                for tr in tbody.find_all('tr'):
                    if row_index % 2 == 1:
                        row_data = []
                        for td in tr.find_all('td'):
                            span_tag = td.find('span')
                            if span_tag is not None:
                                text = span_tag.text.strip()
                                code = text.replace('Brick : ', '')
                                row_data.append(code)

                            # Find the <a> tag within the <td>
                            a_tag = td.find('a')
                            if a_tag is not None:
                                url = a_tag.get('href', None)  # Use get to avoid KeyError
                                text = a_tag.text.strip()
                                row_data.append(url)
                                row_data.append(text)
                            else:
                                row_data.append(td.text.strip())
                        rows.append(row_data[4:7])
                    row_index += 1
            else:
                print("Tbody not found in the table.")

            output = io.StringIO()
            csv_writer = csv.writer(output)
            
            for row in rows:
                csv_writer.writerow(row)

            csv_content = output.getvalue()
            print(csv_content)
        else:
            print("Table not found within the div. Check the structure.")
    else:
        print("Div with id '_idItemTableForP' not found. Check the id.")
else:
    print('Failed to retrieve the webpage. Status code:', response.status_code)

The diff:

if response.status_code == 200:
             if tbody is not None:
                 row_index = 0
                 for tr in tbody.find_all('tr'):
-                    if row_index % 2 == 0:
+                    if row_index % 2 == 1:
                         row_data = []
                         for td in tr.find_all('td'):
+                            span_tag = td.find('span')
+                            if span_tag is not None:
+                                text = span_tag.text.strip()
+                                code = text.replace('Brick : ', '')
+                                row_data.append(code)
+
                             # Find the <a> tag within the <td>
                             a_tag = td.find('a')
                             if a_tag is not None:

Output

3005,/v2/catalog/catalogitem.page?P=3005&name=Brick%201%20x%201&category=%5BBrick%5D,Brick 1 x 1
3004,/v2/catalog/catalogitem.page?P=3004&name=Brick%201%20x%202&category=%5BBrick%5D,Brick 1 x 2
3622,/v2/catalog/catalogitem.page?P=3622&name=Brick%201%20x%203&category=%5BBrick%5D,Brick 1 x 3
3010,/v2/catalog/catalogitem.page?P=3010&name=Brick%201%20x%204&category=%5BBrick%5D,Brick 1 x 4
3009,/v2/catalog/catalogitem.page?P=3009&name=Brick%201%20x%206&category=%5BBrick%5D,Brick 1 x 6
3008,/v2/catalog/catalogitem.page?P=3008&name=Brick%201%20x%208&category=%5BBrick%5D,Brick 1 x 8
3003,/v2/catalog/catalogitem.page?P=3003&name=Brick%202%20x%202&category=%5BBrick%5D,Brick 2 x 2
3002,/v2/catalog/catalogitem.page?P=3002&name=Brick%202%20x%203&category=%5BBrick%5D,Brick 2 x 3
3001,/v2/catalog/catalogitem.page?P=3001&name=Brick%202%20x%204&category=%5BBrick%5D,Brick 2 x 4
2456,/v2/catalog/catalogitem.page?P=2456&name=Brick%202%20x%206&category=%5BBrick%5D,Brick 2 x 6
3007,/v2/catalog/catalogitem.page?P=3007&name=Brick%202%20x%208&category=%5BBrick%5D,Brick 2 x 8
2356,/v2/catalog/catalogitem.page?P=2356&name=Brick%204%20x%206&category=%5BBrick%5D,Brick 4 x 6
4201,/v2/catalog/catalogitem.page?P=4201&name=Brick%208%20x%208&category=%5BBrick%5D,Brick 8 x 8
⇦ question 15 Index