Učni načrt predmeta

Predmet:
Jezikovne tehnologije
Course:
Language Technologies
Študijski program in stopnja /
Study programme and level
Študijska smer /
Study field
Letnik /
Academic year
Semester /
Semester
Informacijske in komunikacijske tehnologije, 2. stopnja Tehnologije znanja 1 2
Information and Communication Technologies, 2nd cycle Knowledge Technologies 1 2
Vrsta predmeta / Course type
Izbirni / Elective
Univerzitetna koda predmeta / University course code:
IKT2-714
Predavanja
Lectures
Seminar
Seminar
Vaje
Tutorial
Klinične vaje
work
Druge oblike
študija
Samost. delo
Individ. work
ECTS
30 30 30 210 10

*Navedena porazdelitev ur velja, če je vpisanih vsaj 15 študentov. Drugače se obseg izvedbe kontaktnih ur sorazmerno zmanjša in prenese v samostojno delo. / This distribution of hours is valid if at least 15 students are enrolled. Otherwise the contact hours are linearly reduced and transfered to individual work.

Nosilec predmeta / Course leader:
doc. dr. Senja Pollak
Sodelavci / Lecturers:
Jeziki / Languages:
Predavanja / Lectures:
slovenščina, angleščina / Slovenian, English
Vaje / Tutorial:
Pogoji za vključitev v delo oz. za opravljanje študijskih obveznosti:
Prerequisites:

Zaključen študijski program prve stopnje s področja naravoslovja, tehnike ali računalništva.

Student must complete first-cycle study programmes in natural sciences, technical disciplines or computer science.

Vsebina:
Content (Syllabus outline):

Uvod:
Razvoj jezikoslovja in računalniškega jezikoslovja, kompleksnost jezika, ravni analize jezika, pregled aplikacij in metod.

Jezikovni korpusi:
Namen, zgodovina in tipologija, označevanje, uporaba, računalniški zapis, konkretni primeri.

Metode računalniške obravnave:
Regularni izrazi, statistične metode, strojno učenje, globoko učenje (modeli arhitekture transformer, generativni modeli).

Področja uporabe:
Iskanje in zajemanje informacij, klasifikacija dokumentov, digitalne knjižnice, itd.

Introduction:
Development of linguistics and computational linguistics, complexity of language, levels of linguistic analysis, overview of applications and methods.

Language corpora:
Purpose, history and typology, annotation, use cases, computer coding, specific examples.

Methods of computer processing:
Regular expressions and finite state automata, phrase-structure grammars, statistical methods, machine learning, deep learning (Transformer architecture, generative models).

Applications:
Information retrieval and extraction, document classification,, digital libraries, etc.

Temeljna literatura in viri / Readings:

Izbrana poglavja iz naslednjih knjig: / Selected chapters from the following books:
- D. Jurafsky, and J.H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, Prentice-Hall, 2024,. https://web.stanford.edu/~jurafsky/slp3/.
- R. Mitkov (ed.). The Oxford Handbook of Computational Linguistics. Oxford University Press, 2003. ISBN 978-0-19-823882-9
- C. Manning, and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press. 1999. ISBN 0-262-13360-1
- N. Ide and J. Pustejovsky (eds.). Handbook of Linguistic Annotation. Springer. 2017. I SBN 978-94-024-0881-2

Cilji in kompetence:
Objectives and competences:

Cilj predmeta je posredovati splošno znanje o jezikovnih tehnologijah, to je metodah in aplikacijah obdelave naravnega jezika na računalniku. Predstavljena je zgodovina in osnovni pojmi jezikoslovja, raznovrstne aplikacije jezikoslovnih tehnologij in računalniške metode, ki se pri njih uporabljajo. Podrobno so obdelani jezikovni korpusi, velike zbirke označenih besedil, ki so osnovna infrastruktura, potrebna za raziskave in obdelavo posameznih jezikov. Obravnavana je tudi analiza jezikovnih korpusov z metodami strojnega učenja. Poudarek predmeta je na obravnavi slovenskega jezika in čezjzikovnih metodah.

Slušatelji pridobijo osnovno teoretično razumevanje in nekaj praktičnih izkušenj s področij jezikovnih tehnologij in korpusnega jezikoslovja, kar je predpogoj za učinkovito delo na računalniški obdelavi jezikovnih podatkov.

The goal of this course is to introduce language technologies, i.e. methods and applications of computer processing of natural language. The course gives the history and basic concepts of linguistics, various applications of language technologies and the computational methods which they use. Particular attention is given to language corpora, large datasets of annotated texts, which serve as the basic infrastructure necessary for research and processing of individual languages. Also discussed is the analysis of language corpora with machine learning methods. The focus of the course is on the processing of Slovene language and cross-lingual methods.

Students will gain basic theoretical understanding and some practical knowledge of language technologies and corpus linguistics, which is a prerequisite for effective work on computer processing of language data.

Predvideni študijski rezultati:
Intendeded learning outcomes:

Študenti bodo z uspešno opravljenimi obveznostmi tega predmeta pridobili:
- sposobnost analize, sinteze in predvidevanja rešitev ter posledic
- obvladanje raziskovalnih metod, postopkov in procesov, razvoj kritične in samokritične presoje
- zavezanost profesionalni etiki in regulativi
- poznavanje zgodovine razvoja in razumevanje konceptov računalniškega jezikoslovja
- osnovno poznavanje tradicionalnih in naprednih metod za obdelavo naravnih jezikov
- pregledno znanje aplikacij jezikovnih tehnologij, njihovih lastnosti in omejitev z vidika možne uporabe v praksi, posebej za slovenski jezik
- sposobnost integriranja znanja in obvladovanja kompleksnosti pri reševanju specifičnih problemov v računalniških aplikacijah

Students successfully completing this course will acquire:
- an ability to analyse, synthesise and anticipate solutions and consequences
- to gain the mastery over research methods, procedures and processes, a development of the critical judgement
- complying with professional ethics and regulatory body policies
- knowledge of history and concept of computational linguistics
- basic understanding of traditional and advanced methods for natural language processing
- overview knowledge of language technology applications, their features and limitations for possible applications in practice
- ability to integrate knowledge and handle complexity when solving specific problems in computer applications

Metode poučevanja in učenja:
Learning and teaching methods:

Predavanja, seminar, konzultacije, samostojno delo

Lectures, seminar, consultations, individual work

Načini ocenjevanja:
Delež v % / Weight in %
Assesment:
Seminar
50
Seminar
Ustni izpit
50
Oral exam
Reference nosilca / Lecturer's references:
1. KOLOSKI, Boshko, STEPIŠNIK PERDIH, Timen, ROBNIK ŠIKONJA, Marko, POLLAK, Senja, ŠKRLJ, Blaž. Knowledge graph informed fake news classification via heterogeneous representation ensembles. Neurocomputing. [Print ed.]. 2022, vol. 496, july, str. 208-226. ISSN 0925-2312. DOI: 10.1016/j.neucom.2022.01.096.
2. MARTINC, Matej, POLLAK, Senja, ROBNIK ŠIKONJA, Marko. Supervised and unsupervised neural approaches to text readability. Computational linguistics. 2021, vol. 47, no. 1, str. 141-179. ISSN 0891-2017. DOI: 10.1162/coli_a_00398
3. ŠKRLJ, Blaž, MARTINC, Matej, KRALJ, Jan, LAVRAČ, Nada, POLLAK, Senja. tax2vec : constructing interpretable features from taxonomies for short text classification. Computer speech & language. 2021, vol. 65, str. 101104-1-101104-21. ISSN 0885-2308. DOI: 10.1016/j.csl.2020.101104
4. MARTINC, Matej, HAIDER, Fasih, POLLAK, Senja, LUZ, Saturnino. Temporal integration of text transcripts and acoustic features for Alzheimer's diagnosis based on spontaneous speech. Frontiers in aging neuroscience. 2021, vol. 13, str. 652647-1-652647-15. ISSN 1663-4365. DOI: 10.3389/fnagi.2021.642647.
5. HONG HANH, Tran Thi, MARTINC, Matej, REPAR, Andraž, LJUBEŠIĆ, Nikola, DOUCET, Antoine, POLLAK, Senja. Can cross-domain term extraction benefit from cross-lingual transfer and nested term labeling?. Machine learning. 2024, vol. 113, march, str. 4285-4314. ISSN 1573-0565. DOI: 10.1007/s10994-023-06506-7.