Clustering and summarization of news articles

Repozytorium PJATK
→
Wydział Informatyki / Faculty of Information Technology
→
Praca inżynierska - Inteligentne systemy przetwarzania danych 2022
→
Zobacz pozycję

dc.contributor.author	Darnowski, Mateusz
dc.date.accessioned	2023-02-23T13:40:36Z
dc.date.available	2023-02-23T13:40:36Z
dc.date.issued	2023-02-23
dc.identifier.issn	2022/I/D/20
dc.identifier.uri	https://repin.pjwstk.edu.pl/xmlui/handle/186319/2532
dc.description.abstract	The proposed solution often fails to identify specific important events effectively; however, the model accurately describes the internal debate conducted in the managed articles set. Newspapers contain arguments, support contracting philosophies. The model-generated summaries cite and copy common perspectives disseminated in documents. The universality of quoted opinions could be only evaluated by extending the processed dataset, adding rec-ords from more diverse data sources. Model behaves correctly when reviewing a small set of related papers and identifies unique sub-topics appearing in a dataset. Topic-summaries distinguishably differ in content and contexts, even if the topics discovered by the model share similar key-words. Increasing the number of articles to be summarized results in creation of more diverse topics; nevertheless, discovered summaries include more quotes and opinions. Complex and lengthy sentences heavily influence the results of similarity measurements and document sorting. Model favors long sentences with varied vocabulary that could be easily linked to numerus discussions, which can hurt the positioning of short, condensed sentences that should be valued in a summary. Sentiment analysis study and reconsidering the assumptions of the model could help in pro-ducing more condensed and less opinion-focused summaries. However, the presented solu-tion successfully captures and outlines the background of conversations conducted in ana-lyzed reports, introduces arguments that could be omitted by excluding long sentences and opinions from processing and summarizations. The model provides coherent and compelling topics and offers interpretations relevant to the public discourse. The usage of Gibbs sampling effectively improves the accuracy of LDA topic-clusters labelings. Proposed paragraphs extraction method successfully removes noise and simplifies topic modeling assignments. Proposed in 6.3.4. Ranking paragraphs on a top-ic approach outperforms Text Rank considerably when handling a paragraphs set; sentences added to summaries are more coherently linked and describe the issues identified in corpus more clearly. The in-depth study and reuse of LDA likelihood estimates provides convincing overviews of the discussions initiated in the analyzed datasets.	pl_PL
dc.language.iso	en	pl_PL
dc.relation.ispartofseries	;Nr 7087
dc.subject	Informatyka	pl_PL
dc.subject	Systemy inteligentne	pl_PL
dc.title	Clustering and summarization of news articles	pl_PL
dc.type	Thesis	pl_PL