Equivalencias SAS – pySpark

Equivalencias SAS - pySpark

A la hora de implementar procesos de data science tenemos diferentes herramientas para hacerlo. Dos de las más utilizadas son SAS y pySpark. Spark es un framework open source para ejecución en un cluster y pyspark la librería para trabajar con Python. En entradas anteriores vimos como trabajar con Python y SAS de forma integrada en la plataforma SAS Viya: https://online.datademy.es/videotutorial-integrar-sas-y-python-en-plataforma-sas-vita/

A continuación vemos una serie de ejemplos de operaciones con los datos realizados con ambas herramientas. Esta lista de ejemplos es la primera de una serie que vamos a ir publicando.

Ejemplos de operaciones con datasets realizados en SAS y su equivalencia en pyspark

Lectura de fichero

Vamos a trabajar con el siguiente fichero. CUSTOMER_CHURN_TELCO. Mostramos un extracto:

customerID;gender;tenure;MultipleLines;MonthlyCharges;StreamingTV;DeviceProtection;Contract;Churn
7590-VHVEG;Female;1;No phone service;29.85;No;No;Month-to-month;No
5575-GNVDE;Male;34;No;56.95;No;Yes;One year;No
3668-QPYBK;Male;2;No;53.85;No;No;Month-to-month;Yes
7795-CFOCW;Male;45;No phone service;;No;Yes;One year;No
9237-HQITU;Female;2;No;70.7;No;No;Month-to-month;Yes

Explicación de los campos:

  • customerid: identificador de clientes
  • gender: género del cliente (male/female)
  • tenure: antigüedad del cliente en meses
  • MultipleLines: indica si cliente tiene varias líneas (Yes/No)
  • MonthlyCharges: cargos mensuales del cliente (factura)
  • StreamingTV: indica si el cliente tiene contratados servicios de TV (Yes/No)
  • DeviceProtection: indica si el cliente tiene un servicio de seguro (Yes/No)
  • Contract: tipo de contrato, duración (Two year, One year, Month-yo_month)
  • Churn: indica si el cliente ha causado baja (Yes/No)

El paso de lectura en SAS es el siguiente:

DATA WORK.CUSTOMER1;
LENGTH   customerID    $10  gender  $6   tenure  8
    MultipleLines  $20 MonthlyCharges  8 StreamingTV $20  
    DeviceProtection $20 Contract  $20 Churn $3 ;
INFILE '/home/ficheros/CUSTOMER_CHURN_TELCO.txt'
    FIRSTOBS=2 DLM=';' DSD ;
INPUT
    customerID  gender tenure MultipleLines MonthlyCharges   
    StreamingTV  DeviceProtection Contract Churn  ;
 RUN;

Obtenemos la equivalencia en pyspark

%fs ls dbfs:/FileStore/tables/

 pathnamesize1234567dbfs:/FileStore/tables/CUSTOMER_CHURN_TELCO.txtCUSTOMER_CHURN_TELCO.txt432871dbfs:/FileStore/tables/SALES.csvSALES.csv60197dbfs:/FileStore/tables/appl_stock.csvappl_stock.csv143130dbfs:/FileStore/tables/people.csvpeople.csv34dbfs:/FileStore/tables/ventas.txtventas.txt50dbfs:/FileStore/tables/ventas_ciudad.txtventas_ciudad.txt57dbfs:/FileStore/tables/ventas_ciudad2.txtventas_ciudad2.txt92

Showing all 7 rows.

Método read

df_customer1 = spark.read \
  .format("csv") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .option("delimiter", ";") \
  .load("dbfs:/FileStore/tables/CUSTOMER_CHURN_TELCO.txt")
df_customer1.show()

+———-+——+——+—————-+————–+——————-+——————-+————–+—–+ |customerID|gender|tenure| MultipleLines|MonthlyCharges| StreamingTV| DeviceProtection| Contract|Churn| +———-+——+——+—————-+————–+——————-+——————-+————–+—–+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes| |9305-CDSKC|Female| 8| Yes| 99.65| Yes| Yes|Month-to-month| Yes| |9225-CDGGG| Male| 1| Yes| 52.65| No| Yes|Month-to-month| Yes| |9625-CDAAG|Female| 2| Yes| 32.75| Yes| Yes|Month-to-month| Yes| |9225-PPGGG| Male| 1| Yes| 31.65| No| Yes|Month-to-month| Yes| |9275-CDGVV|Female| 2| Yes| 34.65| Yes| Yes|Month-to-month| Yes| |9825-UUGGG|Female| 1| Yes| 32.85| Yes| Yes|Month-to-month| Yes| |6225-CDGGT| Male| 1| Yes| 42.65| Yes| Yes|Month-to-month| Yes| |9385-AEUGG| Male| 55| Yes| 65.65| Yes| Yes| Two year| No| |1452-KIOVK| Male| 22| Yes| 89.1| Yes| No|Month-to-month| No| |6713-OKOMC|Female| 10|No phone service| 29.75| No| No|Month-to-month| No| |7892-POOKP|Female| 28| Yes| 104.8| Yes| Yes|Month-to-month| Yes| |6388-TABGU| Male| 62| No| 56.15| No| No| One year| No| |9763-GRSKD| Male| 13| No| 49.95| No| No|Month-to-month| No| |7469-LKBCI| Male| 16| No| 18.95|No internet service|No internet service| Two year| No| |8091-TTVAX| Male| 58| Yes| 100.35| Yes| Yes| One year| No| +———-+——+——+—————-+————–+——————-+——————-+————–+—–+ only showing top 20 rows

Filtros

Realizamos filtrados sobre el dataset leído

El paso de filtro en SAS es el siguiente. Realizado con PASO DATA:

DATA CUST_FIL;
SET CUSTOMER1;
WHERE Churn = 'Yes';
RUN;

Realizado con PROC SQL:

PROC SQL;
CREATE TABLE CUST_FIL AS
(SELECT * FROM CUSTOMER1 WHERE Churn = 'Yes');
QUIT;

Obtenemos la equivalencia en pyspark empleano el método where

df_cust_fil = df_customer1.where("Churn == 'Yes'")
df_cust_fil.show(5)

+———-+——+——+————-+————–+———–+—————-+————–+—–+ |customerID|gender|tenure|MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| +———-+——+——+————-+————–+———–+—————-+————–+—–+ |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes| |9305-CDSKC|Female| 8| Yes| 99.65| Yes| Yes|Month-to-month| Yes| |9225-CDGGG| Male| 1| Yes| 52.65| No| Yes|Month-to-month| Yes| |9625-CDAAG|Female| 2| Yes| 32.75| Yes| Yes|Month-to-month| Yes| +———-+——+——+————-+————–+———–+—————-+————–+—–+ only showing top 5 rows

En pyspark se puede realizar también el filtrado mediante SQL
Primero creamos la tabla con el método createOrReplaceTempView y después la consultamos con el método spark.sql

df_customer1.createOrReplaceTempView("t_customer1")
df_cust_fil = spark.sql("SELECT * FROM t_customer1 where Churn = 'Yes'")
df_cust_fil.show(5)

+———-+——+——+————-+————–+———–+—————-+————–+—–+ |customerID|gender|tenure|MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| +———-+——+——+————-+————–+———–+—————-+————–+—–+ |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes| |9305-CDSKC|Female| 8| Yes| 99.65| Yes| Yes|Month-to-month| Yes| |9225-CDGGG| Male| 1| Yes| 52.65| No| Yes|Month-to-month| Yes| |9625-CDAAG|Female| 2| Yes| 32.75| Yes| Yes|Month-to-month| Yes| +———-+——+——+————-+————–+———–+—————-+————–+—–+ only showing top 5 rows

Proyección

Selección de campos

El paso de proyección en SAS es el siguiente. Realizado con KEEP que indica los campos a mantener:

DATA CUST_SEL (KEEP= customerID  gender tenure);
SET CUSTOMER1;
RUN;

Otra opción es la sentencia DROP que indica los campos a eliminar:

DATA CUST_SEL (DROP= MultipleLines MonthlyCharges   
        StreamingTV  DeviceProtection Contract Churn);
SET CUSTOMER1;
RUN;

Realizado con PROC SQL:

PROC SQL;
CREATE TABLE CUST_SEL AS
(SELECT customerID,  gender, tenure FROM CUSTOMER1 );
QUIT;

En pyspark empleamos el método select

df_cust_sel = df_customer1.select("customerID", "gender", "tenure")
df_cust_sel.show(5)

+———-+——+——+ |customerID|gender|tenure| +———-+——+——+ |7590-VHVEG|Female| 1| |5575-GNVDE| Male| 34| |3668-QPYBK| Male| 2| |7795-CFOCW| Male| 45| |9237-HQITU|Female| 2| +———-+——+——+ only showing top 5 rows

Otra opción es el método drop

df_cust_sel = df_customer1.drop("MultipleLines","MonthlyCharges", "StreamingTV", "DeviceProtection", "Contract", "Churn")
df_cust_sel.show(5)

+———-+——+——+ |customerID|gender|tenure| +———-+——+——+ |7590-VHVEG|Female| 1| |5575-GNVDE| Male| 34| |3668-QPYBK| Male| 2| |7795-CFOCW| Male| 45| |9237-HQITU|Female| 2| +———-+——+——+ only showing top 5 rows

Mediante SQL

Columnas calculadas

En SAS podemos calcular campos en un PASO DATA:

DATA CUSTOMER2;
SET CUSTOMER1;
TENURE_YEARS = TENURE/12;
RUN;

O bien con PROC SQL:

PROC SQL;
CREATE TABLE CUSTOMER2 AS
(SELECT t.*, t.tenure/12 as tenure_years 
FROM CUSTOMER1 t );
QUIT;

En pyspark empleamos las sql functions y el método withColumnn

from pyspark.sql.functions import col
df_customer2 = df_customer1.withColumn("tenure_years", col("tenure")/12 )
df_customer2.show(5)

+———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| tenure_years| +———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No|0.08333333333333333| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| 2.8333333333333335| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes|0.16666666666666666| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No| 3.75| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes|0.16666666666666666| +———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ only showing top 5 rows

Mediante SQL

df_customer2 = spark.sql("SELECT *, tenure/12 as tenure_years FROM t_customer1 ")
df_customer2.show(5)

+———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| tenure_years| +———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No|0.08333333333333333| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| 2.8333333333333335| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes|0.16666666666666666| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No| 3.75| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes|0.16666666666666666| +———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ only showing top 5 rows

Ordenación

En SAS realizamos la ordenación con un procedimiento PROC SORT:

PROC SORT DATA=CUSTOMER1;
by customerID;
run;

En PySpark empleamos método sort

df_customer1.sort("customerID").show(5)

+———-+——+——+————-+————–+———–+—————-+————–+—–+ |customerID|gender|tenure|MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| +———-+——+——+————-+————–+———–+—————-+————–+—–+ |0002-ORFBO|Female| 9| No| 65.6| Yes| No| One year| No| |0003-MKNFE| Male| 9| Yes| 59.9| No| No|Month-to-month| No| |0004-TLHLJ| Male| 4| No| 73.9| No| Yes|Month-to-month| Yes| |0011-IGKFF| Male| 13| No| 98.0| Yes| Yes|Month-to-month| Yes| |0013-EXCHZ|Female| 3| No| 83.9| Yes| No|Month-to-month| Yes| +———-+——+——+————-+————–+———–+—————-+————–+—–+ only showing top 5 rows

Concatenación. Unión de datasets por filas (operación añadir, concatenar)

En SAS podemos realizar la concatenación con un PASO DATA:

DATA CUST_TOT;
SET CUSTOMER1 CUST_FIL;
RUN;

Otra opción es el procedimiento PROC APPEND:

PROC APPEND BASE=CUSTOMER1 NEW=CUST_FIL;
RUN;

Paso de concatenación en PySpark con método union

df_cust_tot.count()

Out[11]: 9089

df_cust_tot = df_customer1.union(df_cust_fil)
df_cust_tot.show(5)

+———-+——+——+—————-+————–+———–+—————-+————–+—–+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| +———-+——+——+—————-+————–+———–+—————-+————–+—–+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes| +———-+——+——+—————-+————–+———–+—————-+————–+—–+ only showing top 5 rows

df_cust_tot.count()

Agrupaciones

En SAS podemos realizar la agrupación con un PROC SQL:

PROC SQL;
CREATE TABLE CUST_AG1 AS
(SELECT CHURN, COUNT(customerID) as numcust, avg(tenure) as avg_ten, 
sum(MonthlyCharges) as sum_mc
from CUSTOMER1
GROUP BY CHURN);
QUIT;

Otra opción es el procedimiento PROC MEANS:

PROC MEANS
PROC MEANS DATA = CUSTOMER1 NWAY;
CLASS Churn;
VAR tenure MonthlyCharges;
output out=CUST_AG(drop=_type_ _freq_)
  sum(MonthlyCharges) = sum_mc 
  mean(tenure) = avg_ten
  N = num_cust;
RUN;

En PySpark empleamos método groupby y agg

import pyspark.sql.functions as f
df_cust_ag = df_customer1.groupBy("Churn").agg(f.count("CustomerId").alias("num_cust"),f.avg("tenure").alias("avg_tenu"),f.sum("MonthlyCharges").alias("su_mc"))
df_cust_ag.show(5)

+—–+——–+——————+——————+ |Churn|num_cust| avg_tenu| su_mc| +—–+——–+——————+——————+ | No| 5203|40.915433403805494|329405.54999999877| | Yes| 1943| 9.902213072568193|133832.49999999983| +—–+——–+——————+——————+

Joins

En SAS podemos realizamos el cruce en un paso data con un MERGE y una ordenación previa:

PROC SORT DATA=CUSTOMER1;
by customerID;
run;

PROC SORT DATA=CITY;
by customerID;
run;

DATA CUSTOMER3;
MERGE CUSTOMER1(in=a) CITY(in=b);
by customerID;
if a;
run;

En PySpark empleamos el método join

df_city = spark.createDataFrame([
    ("7590-VHVEG", "Chicago"),
    ("5575-GNVDE", "Boston"),
    ("3668-QPYBK", "New York"),
    ("7795-CFOCW", "Washington")
  ]) \
  .toDF("id_customer", "city")
df_city.show()

+———–+———-+ |id_customer| city| +———–+———-+ | 7590-VHVEG| Chicago| | 5575-GNVDE| Boston| | 3668-QPYBK| New York| | 7795-CFOCW|Washington| +———–+———-+

df_customer3 = df_customer1.join(df_city, df_customer1["customerID"] == df_city["id_customer"], "outer")
df_customer3.where("customerID in ('7590-VHVEG','5575-GNVDE')").show()

+———-+——+——+—————-+————–+———–+—————-+————–+—–+———–+——-+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn|id_customer| city| +———-+——+——+—————-+————–+———–+—————-+————–+—–+———–+——-+ |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| 5575-GNVDE| Boston| |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No| 7590-VHVEG|Chicago| +———-+——+——+—————-+————–+———–+—————-+————–+—–+———–+——-+

Escritura a fichero

En SAS podemos realizamos el cruce en un paso data:

data _null_;
set CUSTOMER1;
file "/home/ficheros/fcustomer.csv" dlm = ';';
put customerID  gender tenure MultipleLines MonthlyCharges   
    StreamingTV  DeviceProtection Contract Churn;
run;

O bien con el procedimiento PROC EXPORT

proc export data=CUSTOMER1
 outfile="/home/ficheros/fcustomer.csv"  dbms=CSV;
run;

En PySpark empleamos el método write

%fs ls /tmp/test/fcustomer

 pathnamesize1234dbfs:/tmp/test/fcustomer/_SUCCESS_SUCCESS0dbfs:/tmp/test/fcustomer/_committed_2652053655009358260_committed_2652053655009358260111dbfs:/tmp/test/fcustomer/_started_2652053655009358260_started_26520536550093582600dbfs:/tmp/test/fcustomer/part-00000-tid-2652053655009358260-4c54284e-9ade-40ed-8f13-abbc3f814490-2-1-c000.csvpart-00000-tid-2652053655009358260-4c54284e-9ade-40ed-8f13-abbc3f814490-2-1-c000.csv426420

Showing all 4 rows.

df_customer1.write \
  .format("csv") \
  .mode("overwrite") \
  .save("dbfs:/tmp/test/fcustomer")

Tratamiento de fechas

En SAS podemos crear fechas con la función DATE() y usar funciones de fecha como MONTH() y formatos con FORMAT:

data CUSTOMER4;
set CUSTOMER1;
fc_act = DATE();
month_act = MONTH(fc_act);
format fc_act yymmdd10.;
run;
from pyspark.sql.functions import current_date, current_timestamp, month
df_customer4 = df_customer1.withColumn("fc_act", current_date() ).withColumn("month_act", month("fc_act"))
df_customer4.show(5)

+———-+——+——+—————-+————–+———–+—————-+————–+—–+———-+———+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| fc_act|month_act| +———-+——+——+—————-+————–+———–+—————-+————–+—–+———-+———+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No|2021-01-26| 1| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No|2021-01-26| 1| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes|2021-01-26| 1| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No|2021-01-26| 1| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes|2021-01-26| 1| +———-+——+——+—————-+————–+———–+—————-+————–+—–+———-+———+ only showing top 5 rows

pendientes: acceso BBDD (ejemplo Hive), fechas, …

Acceso a BBDD (Hive)

Se crea una librería para acceder a Hive y mediante la librería se pueden realizar pasos data desde SAS:

libname hivesas hadoop server=hadoop srv user='user'; 

data hivesas.t_customer;
set CUSTOMER1;
run;

Tablas del metastore de Hive

spark.sql("show tables").show()

+——–+————–+———–+ |database| tableName|isTemporary| +——–+————–+———–+ | default|appl_stock_csv| false| | default| people_csv| false| | default| sales_csv| false| | default| sales_csv1| false| | default| t_ventas1| false| +——–+————–+———–+

Escribimos la tabla en Hive

df_customer1.write.saveAsTable("t_customer")

Continuaremos publicando más equivalencias. Para conocer más equivalencias entre ambas herramientas, obtener formación, consultoría o asesoría en las mismas contactar con: formacion@datademy.es

Share

Deja una respuesta