Skip to the content.

Headshot

Hello! My name is Rebecca, and I am a second-year Computer Science Master’s student at UCLA as a DeepMind Fellow. I’m currently doing research in NLP, specifically on under-resourced languages and code-switching datasets with Prof. Peng’s Plus Lab, as well as Prof. Cacoullos at Penn State. I received my B.S. with Honors in Computer Science at Stanford University, where I wrote my thesis under the advisement of Prof. Christopher Manning.

Broadly, I like producing language technologies that aid in maintaining language practices. Hobbies include writing, dancing flamenco, and collecting earrings!

Research

Publications

Pattichis, R., LaCasse, D., Trawick, S., & Torres Cacoullos, R. (2023, December). Code-Switching Metrics Using Intonation Units. In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Abstract Code-switching (CS) metrics in NLP that are based on word-level units are misaligned with true bilingual CS behavior. Crucially, CS is not equally likely between any two words, but follows syntactic and prosodic rules. We adapt two metrics, multilinguality and CS probability, and apply them to transcribed bilingual speech, putting forward Intonation Units (IUs) – prosodic speech segments – as basic tokens for NLP tasks. In addition, we calculate these two metrics separately for distinct types of CS: alternating-language multi-word stings and single-word incorporations from one language into another. Results indicate that individual differences according to the two CS metrics are independent, visualized in number and breadth of language bands. However, there is a shared tendency among bilinguals for CS to occur across, rather than within, IU boundaries. That is, bilinguals tend to prosodically separate their two languages. This constraint is blurred when metric calculations do not distinguish multi-word and single-word items. These results call for a reconsideration of units of analysis in future development of CS datasets for NLP tasks.

Pattichis, R., Trawick, S., LaCasse, D., & Torres Cacoullos, R. (2023, July). [SRW] Aligning Code-Switching Metrics with Bilingual Behavior. In The 61st Annual Meeting Of The Association For Computational Linguistics (ACL). (poster)

Abstract Models and metrics of linguistic code-switching (CS) have almost exclusively worked with word-level units. However, any two words are not equally likely CS points in bilingual speech. In addition, other-language single-word items and alternating-language multi-word items have distinct properties. Adapting these familiar metrics to the Intonation Unit (IU), we capture a shared tendency for CS to occur across rather than within prosodic boundaries. This constraint is distorted when single- and multi-word other-language items are merged. Individual differences according to language distribution and CS rates are independent, visualized in the number and breadth of language bands in transcripts of bilingual speech. These results are important to consider in future development of code-switched datasets for NLP tasks, as the IU token and exclusion/inclusion of single-word items highly impacts the CS represented in the input text.

Alvero, A., Pattichis, R., (2022). “Linguistic and Cultural Strategies: Identification and Analysis of Spanish Language Usage in College Admissions Essays”. (Under review) (preprint)

Abstract In US K-12 education, the Spanish language is subject to practices and policies that limit its expression, especially among Latinx students. However, Spanish is seen as a positive form of diversity in higher education. In light of these contradictions, we examine the degree to which Spanish is strategically deployed in selective college admissions by high school students in their admissions essays. We use two years of undergraduate application essays (n = 276,768) and metadata submitted to the University of California by every self-identified Latinx applicant and a racially representative random sample of non-Latinx applicants. To identify Spanish language usage in the text, we develop a computational mixed methods approach by combining machine translation and human reading. Spanish was used by 33\% of Latinx and 15\% of non-Latinx students with stylistic variation by class and ethnicity. We also find that lower income Mexican and Central American applicants were the most likely to use substantive forms of Spanish in their admissions essays as well as provide translations into English. We posit this as an example of students identifying cultural mismatch between themselves and university admissions offices due to the perceived need of translating the Spanish words and phrases.

Pattichis, R. (2022). Centering the Voices of First-Generation Immigrant Youth: Multilingual NLP Methods in the Translanguaging Context. Stanford Digital Repository. Available at https://purl.stanford.edu/nd602zq5759

Abstract Translanguaging, or the act of using multiple languages within a speech utterance (e.g., sentence and/or word), is a global phenomenon for multilingual communities. In the context of the United States, translanguaging is a frequent occurrence among Latin American immigrant communities. While there are several large multilingual models such as XLM-RoBERTa and multilingual BERT, these models have been trained on and evaluated with parallel monolingual data. Upholding parallel monolingualism as the standard definition of multilingualism erases the language practices of many communities of color, including Latin American immigrants in the United States. The consequences are even worse for racialized children in the schooling system who may be labeled as English Language Learners (ELL) for the very notion that their fluency in multiple languages must be separate and apart. This ELL label has immediate consequences regarding future classes they have access to, as well as their own sentiment around and through their language practices. Moreover, there is currently no labeled NLP dataset that includes translanguaging between Spanish and English for the task of sentiment analysis. In collaboration with the Stanford Graduate School of Education, this research aims to center the voices of first-generation Indigenous Latin American immigrant students in NLP research through the task of sentiment analysis. Specifically, this thesis constructs the Interview Transcripts Dataset, an innovate trilingual dataset composed of transcribed interview data that contain instances of translanguaging, as well as a framework for developing these datasets. The findings of this project provide a promising starting point, and emphasize the need to leverage current pre-trained models on similar domains as well as develop a more robust large-scale dataset that centers translanguaging. Ultimately, translanguaging remains an open problem in NLP research tasks.

Final Class Projects

UCLA

CS269 - Fairness, Accountability, and Robustness in Natural Language Processing: “Towards the Equivalence Constraint: Evaluating Code-Switching Benchmarks on a Different Perspective” (repo)

Abstract Code-switching (CS) is increasingly relevant in the field of NLP with the development of multilingual language models. We evaluate current CS datasets on their multilinguality and switching complexity using previously established metrics, and curate a dataset that aligns with the Equivalence Constraint Theory of CS. Currently, this theory is left out of NLP datasets, although data is crucial in studying it further. We perform manual editing and human validation by native English-Spanish speakers. Ultimately, we find that data are either entirely monolingual, or present a skewed perspective of CS patterns (i.e., single-word switches). These findings hold implications for the future collection of CS datasets for NLP.

CS 260 - Machine Learning Algorithms: “Lyric Generation Based on Model Complexity and Repetition Evaluation” (repo)

Stanford

CS 224N - NLP with Deep Learning: “RobustQA Using Data Augmentation”

Abstract This project aims to explore possible improvements and extensions to the RobustQA Default baseline provided by the CS224N Winter quarter staff. Our goal is to create a domain-agnostic question answering system given DistilBERT as a pre-trained transformer model. The main method attempted in this paper is that of Task Adaptive Fine Tuning (TAPT) [1], which entails a pre-training step utilizing the Masked Language Modeling task. This method was combined with experimentation on hyperparameters (batch size, number of epochs, and learning rate) to produce the highestachieving model. Specifically, a pre-trained MLM model with a batch size of 32 yielded an EM of 42.75 and F1 of 61.14, which are each around 2 points higher than the baseline metrics.

CS 221 - Artificial Intelligence: Principles and Techniques: “Modeling Platelet Transfusion for The Stanford Blood Center: Inference Using Sentiment Analysis and Recurrent Neural Networks” (poster)

Abstract Platelets are a blood product that expire within 3 days of arriving to the hospital. The Stanford Hospital system wastes about 10% of platelets annually. Researchers previously used aggregated data in order to predict usage, create a three-day ordering strategy, and thus reduce wastage. However, this ordering strategy was not implemented due to lack of human trust in models. New research attempts to address this issue by using patientlevel prediction. This project aims to aid this research by predicting which surgeries will need a platelet transfusion. The two methods used for prediction are stochastic gradient descent on bag-of-words features and Recurrent Neural Networks.

CS 129 - Applied Machine Learning: “Music Genre Classification Using MFCCs and Neural Networks” (code)

Abstract We approach the music genre classification problem using the GTZAN data-set, which contains 100 30-second song clips for 10 different genres. Our first component of the project revolved around computing the Mel Frequency Cepstral Coefficients (MFCCs) and feeding the result into a variety of classification algorithms: KNN, SVM, and a neural network with fully connected layers (FCNN). We also considered a FCNN classifier based on initial code provided online [5]. We then adopted the FCNN as a baseline model and considered several variations that included: i) reducing the difference between the training and validation errors without sacrificing accuracy, ii) reducing the number of layers to reduce the total number of parameters, and iii) considering a different activation and dropout. Compared to the 47% accuracy achieved by the SVM, we derived reduced FCNN parameter models that gave validation and test accuracy of 61-62%.

Awards and Fellowships

Hobbies

While at Stanford, I danced with Aleta Hayes’ Chocolate Heads company! Check out our performances below:

Before that, I danced flamenco at the National Institute of Flamenco for thirteen years. During my senior year of high school, I performed with the University of New Mexico’s dance company in “Elementos” choreographed by Adrian Santana.