Proyecto UTN FRSN NEWS

¿Qué es?

Es un proyecto personal para practicar procesos robustos de web scraping y automatización, uso colas de tareas, bases de datos SQL y aprender sobre Cloudflare Workers y su ecosistema de herramientas.

¿Cómo funciona?

El proyecto se compone de varios scripts que se encargan de extraer y publicar las últimas noticias de la facultad en un canal de Telegram.

¿Cuál es la fuente?

La fuente de las noticias es el sitio web oficial de la Facultad Regional San Nicolás de la Universidad Tecnológica Nacional, específicamente la sección de noticias que se encuentra en el siguiente enlace: https://www.frsn.utn.edu.ar/?paged=1&page_id=80

¿Dónde puedo verlo?

En el canal de Telegram: https://t.me/utnfrsnnews

¿Qué tecnologías utiliza?

Utiliza Cloudflare Workers para alojar los scripts, Cloudflare D1 como base de datos SQL, Cloudflare Queues para gestionar las tareas y Cloudflare Images para almacenar las imágenes de las noticias.

¿Cómo está estructurado el proyecto?

El proyecto se compone de cuatro aplicaciones principales: el Index Scraper, el Main Scraper, el Messenger y la Webpage. Cada una de ellas tiene una función específica en el proceso de extracción y publicación de las noticias.

Index Scraper

Detecta noticias nuevas y extrae las URLs de las mismas

Corre a los 0 minutos de cada hora.

El workflow del mismo consiste en, primero, traer la URL de la última noticia que tengamos disponible en la base de datos, en caso de ser la primer corrida no va a traer ninguna y se almacena None.

Luego, obtiene las últimas noticias de la facultad, las parsea y se fija si en alguna de las primeras 5 más recientes encuentra la URL de la base de datos.

De ser así, corta en ese punto y filtra todas las noticias que ya existan en la DB. Si luego de filtrar, quedan noticias, insertar todas las URLs que encontró en esa página (en orden cronológico ascendente) en la cola utn-frsn-news-scraper para que el Main Scraper las procese y recupere la información completa de las mismas.

En caso de que la última noticia de la DB no se encuentre en las primeras 5 noticias, se continúa con las 4 páginas siguientes y se hace el mismo análisis pero esta vez de todo el listado de noticias excepto las 5 más viejas (ya que para asegurarnos que no hubo ningún hueco en la publicación de noticias hecho por el origen, asumimos que el publicar noticias previas a la última es una posibilidad de caso límite, y ponemos como margen que pueden llegar a publicar una noticas 5 lugares atrás de la(s) más reciente(s)).

Si corta en algún momento, hace el mismo análisis mencionado, filtra todas las URLs de noticias que ya estén en la DB y agrega las nuevas a la cola utn-frsn-news-scraper.

En caso de no cortar, continúa trayendo lotes de 5 páginas hasta la última página de resultados de noticias, y ahí es cuando hace el corte definitivo y procede al filtrado e inserción en la cola.

Main Scraper

Extrae la información completa de cada noticia a partir de su URL

Corre una única instancia a la vez, y es disparado por la cola utn-frsn-news-scraper.

Cloudflare, si hay tareas en la cola, levanta un solo worker para que procese todas las tareas pendientes. Está en la decisión del desarrollador si procesarlas secuencialmente o paralelamente, lo cual es una gran ventaja ya que este proyecto depende de ser secuencial en el orden en que las noticias se guardan y se publican.

Al finalizar el scraping, guarda la imagen de la noticia en Cloudflare Images y almacena toda la información de la noticia en la DB SQL (Cloudflare D1).

Por último, inserta una tarea en la cola de trabajo utn-frsn-news-messenger para que el Messenger envíe la noticia por Telegram.

Messenger

Envía la noticia por Telegram

Corre una única instancia a la vez, y es disparado por la cola utn-frsn-news-messenger.

Al igual que el workflow anterior, Cloudflare levanta un solo worker para que procese todas las tareas pendientes de la cola de trabajo. Lo cual es muy conveniente para procesar secuencialmente cada tarea pendiente de enviar mensajes y enviar las noticias en el orden cronológico correcto por Telegram.

El workflow de esta aplicación es muy sencillo, para cada tarea levanta la información de la DB del ID de la noticia a enviar, formatea el mensaje y lo envía por Telegram. Envía primero la imagen de la noticia, luego envía el título y el contenido de la noticia en un mensaje aparte. Esto se realiza de esta manera porque el "caption" de la imagen en Telegram tiene un límite de caracteres relativamente bajo comparado a la cantidad de caracteres que contiene el contenido de cada noticia. Incluso, hay veces que las noticias superan el propio límite de caracteres de un mensaje de texto de Telegram, por lo que se hace un corte cada cierta cantidad de caracteres para enviar el contenido en varios mensajes, y así evitar perder información de la noticia. Podemos decir con seguridad que esto pasa muy cada tanto, y a lo sumo se tiene que enviar el contenido en dos mensajes.

¿Cómo es la estructura de la base de datos?

La base de datos se compone de una sola tabla llamada "news" que almacena toda la información relacionada con cada noticia, incluyendo su URL, título, contenido, fecha de creación, entre otros campos.

Campo	Tipo	Descripción
id	INTEGER	Identificador único de cada noticia
url	VARCHAR(511)	URL única que identifica esta noticia
title	VARCHAR(511)	Título de la noticia
content	TEXT	Contenido de la noticia
photo_id	VARCHAR(36)	ID de la imagen en Cloudflare Images
response_elapsed_seconds	REAL	Segundos que tardó el servidor de la facultad en completar la solicitud HTTP
parse_elapsed_seconds	REAL	Segundos que tardó en parsear el HTML
origin_created_at	TEXT	Fecha y hora de creación de la noticia por el origen
indexed_at	TEXT	Momento en que el Index Scraper detectó esta noticia
inserted_at	TEXT	Momento en que el Main Scraper insertó esta noticia

¿Cómo es la estructura de las colas de trabajo?

El proyecto utiliza dos colas de trabajo para gestionar las tareas de scraping y envío de mensajes por Telegram.

La primera cola, llamada utn-frsn-news-scraper, se encarga de almacenar las URLs de las noticias que el Index Scraper detecta como nuevas y que el Main Scraper debe procesar para extraer la información completa de cada noticia.

Campo	Tipo	Descripción
news_url	string	URL de la noticia a procesar
photo_url	string	URL de la foto de la noticia a procesar
inserted_at	string	Fecha de inserción de la tarea en la cola (fecha de indexado de la noticia)

La segunda cola, llamada utn-frsn-news-messenger, se encarga de almacenar los IDs de las noticias que el Main Scraper ha procesado y que el Messenger debe enviar por Telegram.

Campo	Tipo	Descripción
news_id	integer	ID interno de la noticia a procesar
inserted_at	string	Fecha de inserción de la tarea en la cola (fecha de insertado de la noticia)

¿Qué infraestructura se usa?

En la actualidad, todo el proyecto (es decir, las 4 apps) se encuentra alojado en Cloudflare Workers, utilizando Cloudflare D1 como base de datos SQL, Cloudflare Queues para gestionar las tareas y Cloudflare Images para almacenar las imágenes de las noticias.

Se utiliza el plan de pago Workers Paid y el plan de 100k Images (también pago).

El listado de servicios y su documentación es la siguiente:

Cloudflare Workers: https://workers.cloudflare.com/
Cloudflare D1 (main database): https://developers.cloudflare.com/d1/
Cloudflare Images: https://developers.cloudflare.com/images/
Cloudflare Queues: https://developers.cloudflare.com/queues/

¿Siempre se usó la misma infraestructura?

No, el proyecto tuvo varias migraciones de infraestructura hasta llegar a la actual en Cloudflare Workers. Esto se debió a varios factores, como cambios en los planes gratuitos de los servicios utilizados, la necesidad de una mayor robustez y disponibilidad, y la búsqueda de una mejor integración entre las herramientas utilizadas.

Primero se comenzó con Heroku y MongoDB Atlas, luego se migró a GCP (y se mantuvo MongoDB Atlas) y finalmente a Cloudflare Workers.

De cada una de esas etapas he aprendido mucho y de las migraciones aprendí más todavía.

Heroku

Lo que puedo contar de Heroku es que fue una buena plataforma para comenzar, especialmente por su facilidad de uso y su ecosistema de herramientas (en su momento las que usé eran gratuitas).

Se utilizaba Scheduler y Task para programar y ejecutar las aplicaciones de scraping y mensajeo/notificación.

A partir del 28/11/2022, Heroku eliminó sus planes gratuitos, lo que llevó a la necesidad de migrar a otra plataforma para mantener la austeridad del proyecto.

Repositorio del proyecto en el momento que se usaba Heroku

Google Cloud Platform (GCP)

Al momento de migrar, GCP resolvía todas las necesidades del proyecto y tenía un free tier bastante generoso para lo que se necesitaba.

Se utilizó el servicio Pub/Sub para crear tópicos y suscribir a los mismos las aplicaciones (Index Scraper, Main Scraper y Messenger)

Cloud Schedule se usaba para programar el insertado de un mensaje en el tópico que levantaba la instancia del Index Scraper.

Cloud Functions se encargaba de instanciar las aplicaciones (Index Scraper, Main Scraper y Messenger) al recibir un mensaje en el tópico correspondiente.

Seguimos utilizando MongoDB como base de datos principal y para guardar los registros y resultados de las colas de trabajo. Utilizar SQL en GCP era demasiado caro para el uso que le íbamos a dar, por lo que se mantuvo el free tier de MongoDB Atlas.

La gestión de los servicios se hacían principalmente mediante scripts de Shell, y también ayudaba mucho la plataforma Console de GCP.

Repositorio del proyecto en el momento que se usaba GCP

Cloudflare

Lo que usamos hoy en día en Cloudflare fue explicado en el apartado anterior.

En este nueva infra se hicieron dos grandes pasos, el primero fue que se comenzó a utilizar como base de datos principal Cloudflare D1 en lugar de MongoDB Atlas. Esto se debió a que D1 ofrece una solución SQL integrada con el ecosistema de Cloudflare Workers, lo que simplifica la gestión y el desarrollo del proyecto, y además tiene un free tier bastante generoso.

Lo segundo fue que se hosteó una aplicación full-stack web para mostrar las noticias en una página web sencilla, utilizando FastAPI en el backend y Jinja2+TailwindCSS para el renderizado de plantillas.

Todo esto está montado en una de las redes más grandes y robustas del mundo, con una infraestructura altamente disponible y escalable.

Los servicios son configurables mediante un JSON dentro del proyecto llamado wrangler.jsonc, lo que facilita la gestión y el despliegue de las aplicaciones.

El despliegue es muy sencillo, con un solo comando uv run pywrangler deploy o mejor aún con el deploy automático mediante GitHub Actions (el cual está configurado en el repositorio).

Apenas comencé con la migración del código, me encontré que promueven el uso de uv para gestionar los proyectos de Python en Workers, y la verdad es que es una herramienta fantástica que facilita mucho la vida del desarrollador.

Aún así, la migración de GCP a Cloudflare fue un gran desafío, tanto como para refactorizar gran parte del código, para ser asíncrono y utlizar Pyodide (ya que Cloudflare Workers no soporta Python de forma nativa, sino que utiliza WebAssembly mediante Pyodide), como también para transferir toda la información de MongoDB Atlas a Cloudflare D1 (SQL).

Todo este gran desafío que llevó unos 3 días de mucho esfuerzo y aprendizaje quedó documentado en el Github Issue #3 del repositorio:

Refactor + Migración de GCP a Cloudflare Workers

¿Dónde puedo ver el código fuente?

El código fuente del proyecto está disponible en GitHub en el siguiente enlace:

Repositorio del proyecto

¿Quién lo desarrolló?

El proyecto fue desarrollado por Goran Prpic, un desarrollador argentino apasionado por la tecnología y el aprendizaje continuo.

Puedes contactarme o ver más de mis proyectos en los siguientes enlaces:

GitHub: https://github.com/gorandp
Linkedin: https://linkedin.com/in/gorandp
Telegram: https://t.me/gorandp

UTN FRSN NEWS Project

What is it?

This is a personal project to practice robust processes of web scraping and automation, use of task queues, SQL databases, and learn about Cloudflare Workers and its ecosystem of tools.

How does it work?

The project consists of several scripts responsible for extracting and publishing the latest faculty news to a Telegram channel.

What is the source?

The source of the news is the official website of the Facultad Regional San Nicolás of the Universidad Tecnológica Nacional, specifically the news section found at the following link: https://www.frsn.utn.edu.ar/?paged=1&page_id=80

Where can I see it?

On the Telegram channel: https://t.me/utnfrsnnews

What technologies does it use?

It uses Cloudflare Workers to host the scripts, Cloudflare D1 as the SQL database, Cloudflare Queues to manage tasks, and Cloudflare Images to store news images.

How is the project structured?

The project consists of four main applications: the Index Scraper, the Main Scraper, the Messenger, and the Webpage. Each has a specific role in the news extraction and publication process.

Index Scraper

Detects new news and extracts their URLs

Runs at the 0th minute of every hour.

Its workflow starts by fetching the URL of the latest news available in the database. If it's the first run, it fetches none and stores None.

Then, it retrieves the latest faculty news, parses them, and checks if it finds the database URL among the first 5 most recent ones.

If so, it stops at that point and filters out any news that already exist in the DB. If news remain after filtering, it inserts all the URLs found on that page (in ascending chronological order) into the utn-frsn-news-scraper queue for the Main Scraper to process and retrieve the full information.

If the latest DB news is not found in the first 5 news, it continues with the next 4 pages and performs the same analysis but this time on the entire news list except the 5 oldest (since to ensure there were no gaps in news publication by the source, we assume publishing news prior to the latest is an edge case, and we set a margin that they might publish a news 5 spots behind the most recent ones).

If it stops at any point, it performs the mentioned analysis, filters all news URLs that already exist in the DB, and adds the new ones to the utn-frsn-news-scraper queue.

If it doesn't stop, it fetches batches of 5 pages and it continues to the last page of news results, and then performs the cutoff and proceeds to filtering and insertion into the queue.

Main Scraper

Extracts the full information of each news from its URL

Runs a single instance at a time, triggered by the utn-frsn-news-scraper queue.

Cloudflare, if there are tasks in the queue, spins up a single worker to process all pending tasks. It's up to the developer whether to process them sequentially or in parallel, which is a great advantage since this project depends on being sequential in the order news are saved and published.

Upon finishing scraping, it saves the news image to Cloudflare Images and stores all news information in the SQL DB (Cloudflare D1).

Finally, it inserts a task into the utn-frsn-news-messenger work queue for the Messenger to send the news via Telegram.

Messenger

Sends the news via Telegram

Runs a single instance at a time, triggered by the utn-frsn-news-messenger queue.

Like the previous workflow, Cloudflare spins up a single worker to process all pending tasks in the work queue. This is very convenient for processing each pending messaging task sequentially and sending news in the correct chronological order via Telegram.

This application's workflow is very simple: for each task, it fetches the news information from the DB by ID, formats the message, and sends it via Telegram. It sends the news image first, then the title and content in a separate message. This is done because Telegram's image "caption" has a relatively low character limit compared to the amount of characters in each news content. In fact, sometimes news exceed Telegram's own text message character limit, so the content is split every certain number of characters to send it in multiple messages, thus avoiding losing news information. We can safely say this happens very rarely, and at most, the content has to be sent in two messages.

What is the database structure?

The database consists of a single table called "news" that stores all information related to each news item, including its URL, title, content, creation date, among other fields.

Field	Type	Description
id	INTEGER	Unique identifier for each news item
url	VARCHAR(511)	Unique URL identifying this news
title	VARCHAR(511)	News title
content	TEXT	News content
photo_id	VARCHAR(36)	Image ID in Cloudflare Images
response_elapsed_seconds	REAL	Seconds the faculty server took to complete the HTTP request
parse_elapsed_seconds	REAL	Seconds taken to parse the HTML
origin_created_at	TEXT	Date and time of news creation by the source
indexed_at	TEXT	Time when the Index Scraper detected this news
inserted_at	TEXT	Time when the Main Scraper inserted this news

What is the structure of the work queues?

The project uses two work queues to manage scraping tasks and message sending via Telegram.

The first queue, called utn-frsn-news-scraper, stores the URLs of news detected as new by the Index Scraper that the Main Scraper must process to extract the full information of each news.

Field	Type	Description
news_url	string	URL of the news to process
photo_url	string	URL of the news photo to process
inserted_at	string	Date of task insertion into the queue (news indexing date)

The second queue, called utn-frsn-news-messenger, stores the IDs of news processed by the Main Scraper that the Messenger must send via Telegram.

Field	Type	Description
news_id	integer	Internal ID of the news to process
inserted_at	string	Date of task insertion into the queue (news insertion date)

What infrastructure is used?

Currently, the entire project (that is, the 4 apps) is hosted on Cloudflare Workers, using Cloudflare D1 as the SQL database, Cloudflare Queues to manage tasks, and Cloudflare Images to store news images.

It uses the Workers Paid plan and the 100k Images plan (also paid).

The list of services and their documentation is as follows:

Cloudflare Workers: https://workers.cloudflare.com/
Cloudflare D1 (main database): https://developers.cloudflare.com/d1/
Cloudflare Images: https://developers.cloudflare.com/images/
Cloudflare Queues: https://developers.cloudflare.com/queues/

Has the same infrastructure always been used?

No, the project has undergone several infrastructure migrations to reach the current one on Cloudflare Workers. This was due to various factors, such as changes in the free plans of the services used, the need for greater robustness and availability, and the search for better integration between the tools used.

It started with Heroku and MongoDB Atlas, then migrated to GCP (while keeping MongoDB Atlas), and finally to Cloudflare Workers.

From each of those stages, I learned a lot, and from the migrations, I learned even more.

Heroku

What I can say about Heroku is that it was a good platform to start with, especially due to its ease of use and ecosystem of tools (which at the time were free).

Scheduler and Task were used to schedule and run the scraping and messaging/notification applications.

Starting November 28, 2022, Heroku removed its free plans, which led to the need to migrate to another platform to maintain the project's austerity.

Project repository at the time Heroku was used

Google Cloud Platform (GCP)

At the time of migration, GCP met all the project's needs and had a quite generous free tier for what was required.

The Pub/Sub service was used to create topics and subscribe the applications (Index Scraper, Main Scraper, and Messenger) to them.

Cloud Scheduler was used to schedule the insertion of a message into the topic that triggered the Index Scraper instance.

Cloud Functions handled instantiating the applications (Index Scraper, Main Scraper, and Messenger) upon receiving a message in the corresponding topic.

We continued using MongoDB as the main database and to store queue records and results. Using SQL on GCP was too expensive for our use, so the MongoDB Atlas free tier was kept.

Service management was done mainly through Shell scripts, and the GCP Console platform also helped a lot.

Project repository at the time GCP was used

Cloudflare

What we use today on Cloudflare was explained in the previous section.

In this new infra, two major steps were taken: first, Cloudflare D1 began to be used as the main database instead of MongoDB Atlas. This was because D1 offers an SQL solution integrated with the Cloudflare Workers ecosystem, which simplifies project management and development, and also has a quite generous free tier.

Second, a full-stack web application was hosted to display news on a simple webpage, using FastAPI in the backend and Jinja2+TailwindCSS for template rendering.

All of this is mounted on one of the world's largest and most robust networks, with highly available and scalable infrastructure.

Services are configurable via a JSON file within the project called wrangler.jsonc, which facilitates application management and deployment.

Deployment is very simple, with a single command uv run pywrangler deploy or even better with automatic deployment via GitHub Actions (which is configured in this repository).

As soon as I started the code migration, I found that they promote the use of uv for managing Python projects in Workers, and honestly, it's a fantastic tool that makes the developer's life much easier.

Nevertheless, migrating from GCP to Cloudflare was a huge challenge, both to refactor much of the code to be asynchronous and use Pyodide (since Cloudflare Workers does not support Python natively, but uses WebAssembly via Pyodide), as well as to transfer all information from MongoDB Atlas to Cloudflare D1 (SQL).

This great challenge that took about 3 days of intense effort and learning was documented in GitHub Issue #3 of the repository:

Refactor + Migration from GCP to Cloudflare Workers

Where can I see the source code?

The project's source code is available on GitHub at the following link:

Project Repository

Who developed it?

The project was developed by Goran Prpic, an Argentine developer passionate about technology and continuous learning.

You can contact me or see more of my projects at the following links:

GitHub: https://github.com/gorandp
LinkedIn: https://linkedin.com/in/gorandp
Telegram: https://t.me/gorandp